Natural Language Processing App Gains Increased Vector Precision

March 1, 2016

For us, concepts have meaning in relationship to other concepts, but it’s easy for computers to define concepts in terms of usage statistics. The post Sense2vec with spaCy and Gensim from SpaCy’s blog offers a well-written outline explaining how natural language processing works highlighting their new Sense2vec app. This application is an upgraded version of word2vec which works with more context-sensitive word vectors. The article describes how this Sense2vec works more precisely,

“The idea behind sense2vec is super simple. If the problem is that duck as in waterfowl andduck as in crouch are different concepts, the straight-forward solution is to just have two entries, duckN and duckV. We’ve wanted to try this for some time. So when Trask et al (2015) published a nice set of experiments showing that the idea worked well, we were easy to convince.

We follow Trask et al in adding part-of-speech tags and named entity labels to the tokens. Additionally, we merge named entities and base noun phrases into single tokens, so that they receive a single vector.”

Curious about the meta definition of natural language processing from SpaCy, we queried natural language processing using Sense2vec. Its neural network is based on every word on Reddit posted in 2015. While it is a feat for NLP to learn from a dataset on one platform, such as Reddit, what about processing that scours multiple data sources?

 

Megan Feil, March 1, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Elasticsearch Works for Us 24/7

February 5, 2016

Elasticsearch is one of the most popular open source search applications and it has been deployed for personal as well as corporate use.  Elasticsearch is built on another popular open source application called Apache Lucene and it was designed for horizontal scalability, reliability, and easy usage.  Elasticsearch has become such an invaluable piece of software that people do not realize just how useful it is.  Eweek takes the opportunity to discuss the search application’s uses in “9 Ways Elasticsearch Helps Us, From Dawn To Dusk.”

“With more than 45 million downloads since 2012, the Elastic Stack, which includes Elasticsearch and other popular open-source tools like Logstash (data collection), Kibana (data visualization) and Beats (data shippers) makes it easy for developers to make massive amounts of structured, unstructured and time-series data available in real-time for search, logging, analytics and other use cases.”

How is Elasticsearch being used?  The Guardian is daily used by its readers to interact with content, Microsoft Dynamics ERP and CRM use it to index and analyze social feeds, it powers Yelp, and her is a big one Wikimedia uses it to power the well-loved and used Wikipedia.  We can already see how much Elasticsearch makes an impact on our daily lives without us being aware.  Other companies that use Elasticsearch for our and their benefit are Hotels Tonight, Dell, Groupon, Quizlet, and Netflix.

Elasticsearch will continue to grow as an inexpensive alternative to proprietary software and the number of Web services/companies that use it will only continues to grow.

Whitney Grace, February 5, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

The Enterprise and Online Anonymity Networks

February 3, 2016

An article entitled Tor and the enterprise 2016 – blocking malware, darknet use and rogue nodes from Computer World UK discusses the inevitable enterprise concerns related to anonymity networks. Tor, The Onion Router, has gained steam with mainstream internet users in the last five years. According to the article,

“It’s not hard to understand that Tor has plenty of perfectly legitimate uses (it is not our intention to stigmatise its use) but it also has plenty of troubling ones such as connecting to criminal sites on the ‘darknet’, as a channel for malware and as a way of bypassing network security. The anxiety for organisations is that it is impossible to tell which is which. Tor is not the only anonymity network designed with ultra-security in mind, The Invisible Internet Project (I2P) being another example. On top of this, VPNs and proxies also create similar risks although these are much easier to spot and block.”

The conclusion this article draws is that technology can only take the enterprise so far in mitigating risk. Reliance on penalties for running unauthorized applications is their suggestion, but this seems to be a short-sighted solution if popularity of anonymity networks rise.

 

Megan Feil, February 3, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Measuring Classifiers by a Rule of Thumb

February 1, 2016

Computer programmers who specialize in machine learning, artificial intelligence, data mining, data visualization, and statistics are smart individuals, but they sometimes even get stumped.  Using the same form of communication as reddit and old-fashioned forums, Cross Validated is a question an answer site run by Stack Exchange.   People can post questions related to data and relation topics and then wait for a response.  One user posted a question about “Machine Learning Classifiers”:

“I have been trying to find a good summary for the usage of popular classifiers, kind of like rules of thumb for when to use which classifier. For example, if there are lots of features, if there are millions of samples, if there are streaming samples coming in, etc., which classifier would be better suited in which scenarios?”

The response the user received was that the question was too broad.  Classifiers perform best depending on the data and the process that generates it.  It is kind of like asking the best way to organize books or your taxes, it depends on the content within the said items.

Another user replied that there was an easy way to explain the general process of understanding the best way to use classifiers.  The user directed users to the Sci-Kit.org chart about “choosing the estimator”. Other users say that the chart is incomplete, because it does not include deep learning, decision trees, and logistic regression.

We say create some other diagrams and share those.  Classifiers are complex, but they are a necessity to the artificial intelligence and big data craze.

 

Whitney Grace, February 1, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Data Discrimination Is Real

January 22, 2016

One of the best things about data and numbers is that they do not lie…usually.  According to Slate’s article, “FTC Report Details How Big Data Can Discriminate Against The Poor,” big data does a huge disservice to people of lower socioeconomic status by reinforcing existing negative patterns.  The Federal Trade Commission (FTC), academics, and activists have expressed for some time that big data analytics.

“At its worst, big data can reinforce—and perhaps even amplify—existing disparities, partly because predictive technologies tend to recycle existing patterns instead of creating new openings. They can be especially dangerous when they inform decisions about people’s access to healthcare, credit, housing, and more. For instance, some data suggests that those who live close to their workplaces are likely to maintain their employment for longer. If companies decided to take that into account when hiring, it could be accidentally discriminatory because of the radicalized makeup of some neighborhoods.”

The FTC stresses that big data analytics has positive benefits as well.  It can yield information that can create more job opportunities, transform health care delivery, give credit through “non-traditional methods, and more.

The way big data can avoid reinforcing these problems and even improve upon them is to include biases from the beginning.  Large data sets can make these problems invisible or even harder to recognize.  Companies can use prejudiced data to justify the actions they take and even weaken the effectiveness of consumer choice.

Data is supposed to be an objective tool, but the sources behind the data can be questionable.  It becomes important for third parties and the companies themselves to investigate the data sources, run multiple tests, and confirm that the data is truly objective.  Otherwise we will be dealing with social problems and more reinforced by bad data.

Whitney Grace, January 22, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

The Ins and Outs of Hacking Software

December 23, 2015

Hacking software is and could be a potential problem.  While some government agencies, hacktivist organizations, and software companies are trying to use it for good, terrorist groups, digital thieves, and even law enforcement agencies can use it to spy and steal data from individuals.  The Technology Review shares some interesting stories about how software is being used for benign and harmful purposes in “The Growth Industry Helping Governments Hack Terrorists, Criminals, And Political Opponents.”

The company Hacking Team is discussed at length and its Remote Control System software, which can worm its way through security holes in a device and steal valuable information.  Governments from around the globe have used the software for crime deterrence and to keep tabs on enemies, but other entities used the software for harmful acts including spying and hacking into political opponents computers.

Within the United States, it is illegal to use a Remote Control System without proper authority, but often this happens:

“When police get access to new surveillance technologies, they are often quickly deployed before any sort of oversight is in place to regulate their use. In the United States, the abuse of Stingrays—devices that sweep up information from cell phones in given area—has become common. For example, the sheriff of San Bernardino County, near Los Angeles, deployed them over 300 times without a warrant in the space of less than two years. That problem is only being addressed now, years after it emerged, with the FBI now requiring a warrant to use Stingrays, and efforts underway to force local law enforcement to do the same. It’s easy to imagine a similar pattern of abuse with hacking tools, which are far more powerful and invasive than other surveillance technologies that police currently use.”

It is scary how the software is being used and how governments are skirting around its own laws to use it.  It reminds me of how gun control is always controversial topic.  Whenever there is a mass shooting, debates rage about how the shooting would never had happened if there was stricter gun control to keep weapons out of the hands of psychopaths.  While the shooter was blamed for the incident, people also place a lot of blame on the gun, as if it was more responsible.  As spying, control, and other software becomes more powerful and ingrained in our lives, I imagine there will be debates about “software control” and determining who has the right to use certain programs.

Whitney Grace, December 23, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Topology Is Finally on Top

December 21, 2015

Topology’s  time has finally come, according to “The Unreasonable Usefulness of Imagining You Live in a Rubbery World,” shared by 3 Quarks Daily. The engaging article reminds us that the field of topology  emphasizes connections over geometric factors like distance and direction. Think of a subway map as compared to a street map; or, as writer Jonathan Kujawa describes:

“Topologists ask a question which at first sounds ridiculous: ‘What can you say about the shape of an object if you have no concern for lengths, angles, areas, or volumes?’ They imagine a world where everything is made of silly putty. You can bend, stretch, and distort objects as much as you like. What is forbidden is cutting and gluing. Otherwise pretty much anything goes.”

Since the beginning, this perspective has been dismissed by many as purely academic. However, today’s era of networks and big data has boosted the field’s usefulness. The article observes:

“A remarkable new application of topology has emerged in the last few years. Gunnar Carlsson is a mathematician at Stanford who uses topology to extract meaningful information from large data sets. He and others invented a new field of mathematics called Topological data analysis. They use the tools of topology to wrangle huge data sets. In addition to the networks mentioned above, Big Data has given us Brobdinagian sized data sets in which, for example, we would like to be able to identify clusters. We might be able to visually identify clusters if the data points depend on only one or two variables so that they can be drawn in two or three dimensions.”

Kujawa goes on to note that one century-old tool of topology, homology, is being used to analyze real-world data, like the ways diabetes patients have responded to a specific medication. See the well-illustrated article for further discussion.

Cynthia Murrell, December 21, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

The Modern Law Firm and Data

December 16, 2015

We thought it was a problem if law enforcement officials did not know how the Internet and Dark Web worked as well as the capabilities of eDiscovery tools, but a law firm that does not know how to work with data-mining tools much less the importance of technology is losing credibility, profit, and evidence for cases.  According to Information Week in “Data, Lawyers, And IT: How They’re Connected” the modern law firm needs to be aware of how eDiscovery tools, predictive coding, and data science work and see how they can benefit their cases.

It can be daunting trying to understand how new technology works, especially in a law firm.  The article explains how the above tools and more work in four key segments: what role data plays before trial, how it is changing the courtroom, how new tools pave the way for unprecedented approaches to law practice, how data is improving how law firms operate.

Data in pretrial amounts to one word: evidence.  People live their lives via their computers and create a digital trail without them realizing it.  With a few eDiscovery tools lawyers can assemble all necessary information within hours.  Data tools in the courtroom make practicing law seem like a scenario out of a fantasy or science fiction novel.  Lawyers are able to immediately pull up information to use as evidence for cross-examination or to validate facts.  New eDiscovery tools are also good to use, because it allows lawyers to prepare their arguments based on the judge and jury pool.  More data is available on individual cases rather than just big name ones.

“The legal industry has historically been a technology laggard, but it is evolving rapidly to meet the requirements of a data-intensive world.

‘Years ago, document review was done by hand. Metadata didn’t exist. You didn’t know when a document was created, who authored it, or who changed it. eDiscovery and computers have made dealing with massive amounts of data easier,’ said Robb Helt, director of trial technology at Suann Ingle Associates.”

Legal eDiscovery is one of the main branches of big data that has skyrocketed in the past decade.  While the examples discussed here are employed by respected law firms, keep in mind that eDiscovery technology is still new.  Ambulance chasers and other law firms probably do not have a full IT squad on staff, so when learning about lawyers ask about their eDiscovery capabilities.

Whitney Grace, December 16, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Content Matching Helps Police Bust Dark Web Sex Trafficking Ring

September 4, 2015

The Dark Web is not only used to buy and sell illegal drugs, but it is also used to perpetuate sex trafficking, especially of children.  The work of law enforcement agencies working to prevent the abuse of sex trafficking victims is detailed in a report by the Australia Broadcasting Corporation called “Secret ‘Dark Net’ Operation Saves Scores Of Children From Abuse; Ringleader Shannon McCoole Behind Bars After Police Take Over Child Porn Site.”  For ten months, Argos, the Queensland, police anti-pedophile taskforce tracked usage on an Internet bulletin board with 45,000 members that viewed and uploaded child pornography.

The Dark Web is notorious for encrypting user information and that is one of the main draws, because users can conduct business or other illegal activities, such as view child pornography, without fear of retribution.  Even the Dark Web, however, leaves a digital trail and Argos was able to track down the Web site’s administrator.  It turned out the administrator was an Australian childcare worker who had been sentenced to 35 years in jail for sexually abusing seven children in his care and sharing child pornography.

Argos was able to catch the perpetrator by noticing patterns in his language usage in posts he made to the bulletin board (he used the greeting “hiya”). Using advanced search techniques, the police sifted through results and narrowed them down to a Facebook page and a photograph.  From the Facebook page, they got the administrator’s name and made an arrest.

After arresting the ringleader, Argos took over the community and started to track down the rest of the users.

” ‘Phase two was to take over the network, assume control of the network, try to identify as many of the key administrators as we could and remove them,’ Detective Inspector Jon Rouse said.  ‘Ultimately, you had a child sex offender network that was being administered by police.’ ”

When they took over the network, the police were required to work in real-time to interact with the users and gather information to make arrests.

Even though the Queensland police were able to end one Dark Web child pornography ring and save many children from abuse, there are still many Dark Web sites centered on child sex trafficking.

 

Whitney Grace, September 4, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

 

 

Chinese Opinion Monitoring Software by Knowlesys

August 18, 2015

Ever wonder what tools the Chinese government uses to keep track of those pesky opinions voiced by its citizens? If so, take a look at “IPOMS : Chinese Internet Public Opinion Monitoring System” at Revolution News. The brief write-up tells us about a software company, Knowlesys, reportedly supplying such software to China (among other clients). Reporter and Revolution News founder Jennifer Baker tells us:

“Knowlesys’ system can collect web pages with some certain key words from Internet news, topics on forum and BBS, and then cluster these web pages according to different ‘event’ groups. Furthermore, this system provides the function of automatically tracking the progress of one event. With this system, supervisors can know what is exactly happening and what has happened from different views, which can improve their work efficiency a lot. Most of time, the supervisor is the government, the evil government. sometimes a company uses the system to collect information for its products. IPOMS is composed of web crawler, html parser and topic detection and tracking tool.”

The piece includes a diagram that lays out the software’s process, from extraction to analysis to presentation (though the specifics are pretty standard to anyone familiar with data analysis in general). Data monitoring and mining firm Knowlesys was founded in 2003. The company has offices in Hong Kong and a development center in Schenzhen, China.

Cynthia Murrell, August 18, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta