April 20, 2016
Social media services attempt to eliminate the publishing of pornographic content on their sites through a combination of user reporting and algorithms. However, Daily Star reports Shock as one million explicit porn films found on Instagram. This content existed on Instagram despite their non-nudity policy. However, according to the article, much of the pornographic videos and photos were removed after news broke. Summarizing how the content was initially published, the article states,
“The videos were unearthed by tech blogger Jed Ismael, who says he’s discovered over one million porn films on the site. Speaking on his blog, Ismael said: “Instagram has banned certain English explicit hashtags from being showed in search. “Yet users seem to find a way around the policy, by using non English terms or hashtags. “I came across this discovery by searching for the hashtag “?????” which means movies in Arabic.” Daily Star Online has performed our own search and easily found hardcore footage without the need for age verification checks.”
While Tor has typically been seen as the home for such services, it appears some users have found a workaround. Who needs the Dark Web? As for those online translation systems, perhaps some services should consider their utility.
Megan Feil, April 20, 2016
April 19, 2016
Remember when user information was leaked from the extramarital affairs website AshleyMadison? While the leak caused many controversies, the release of this information specifically on the Dark Web gives reason to revisit an article from Mashable, Another blow for Ashley Madison: User emails leaked on Dark Web as a refresher on the role Tor played. A 10-gigabyte file was posted as a Torrent on the Dark Web which included emails and credit card information among other user data. The article concluded,
“With the data now out there, Internet users are downloading and sifting through it for anything – or, rather, anyone – of note. Lists of email addresses of AshleyMadison users are being circulated on social media. Several appear to be connected to members of the UK government but are likely fake. As Wired notes, the site doesn’t require email verification, meaning the emails could be fake or even hijacked.”
The future of data breaches and leaks may be unclear, but the falsification of information — leaked or otherwise — always remains a possibility. Regardless of the element of scandal existing in future leaks, it is important to note that hackers and other groups are likely not above manipulation of information.
Megan Feil, April 19, 2016
April 18, 2016
I read a story about matching up user queries with images. I don’t think Google’s image search is particularly good. Examples range from Google’s obsession with taking a query like “truth” and returning images of pictures with the word “truth” in them. And this image:
What about the query for “watson.” Google showed a picture of a computer, a person named “sherlock,” and images of this guy:
The write up “Do Google’s ‘Unprofessional Hair’ Results Show It Is Racist?” wants to point out that Google’s methods have a nasty side. I noted this passage:
We’ve always conceived of search engines as arcane but neutral creatures, obedient only to our will and to the precious logic of information. Older engines from the advent of the internet reflected this: Remember “Ask Jeeves,” the genteel butler? Dogpile, which would “fetch” things for you? Despite this fantasy, the things engines and their algorithms are able to know and to find are influenced by the content we give them to work with, which means they may reflect our own biases.
AskJeeves was a human powered system. The Google is algorithmic. Google does not “give” its image search system content. The image search system indexes what it finds, within the depth settings for the crawl. Sorry, gentle reader, Google does not index everything available via the Internet. Bummer, right?
I circled this statement:
is its [image search’s] purpose to reflect and reinforce what its users feel, do and believe? Or is it to show us a fuller picture of the world and all things contained in it as they really are? Google Images was conceived in response to what people most wanted to see. Maybe it hasn’t decided yet what we most need to see.
The Guardian itself is an interesting legal search. Run the query “guardian” on Google Images and what does one find? Here you go:
The logo of the “real” journalistic thing and the word “truth.” Now is that biased?
Stephen E Arnold, April 18, 2016
April 18, 2016
The article titled Mindbreeze and MEDIALIFE Launch Strategic Partnership on BusinessWire discusses what the merger means for the Slovak and Czech Republic enterprise search market. MediaLife emphasizes its concentrated approach to document management systems for Slovak customers in need of large systems for the management, processing, and storage of documents. The article details,
“Based on this partnership, we provide our customers innovative solutions for fast access to corporate data, filtering of relevant information, data extraction and their use in automated sorting (classification)… Powerful enterprise search systems for businesses must recognize relationships among different types of information and be able to link them accordingly. Mindbreeze InSpire Appliance is easy to use, has a high scalability and shows the user only the information which he or she is authorized to view.”
Daniel Fallmann, founder and CEO of Mindbreeze, complimented himself on his selection of a partner in MediaLife and licked his chops at the prospect of the new Eastern European client base opened to Mindbreeze through the partnership. Other Mindbreeze partners exist in Italy, the UK, Germany, Mexico, Canada, and the USA, as the company advances its mission to supply enterprise search appliances as well as big data and knowledge management technologies.
Chelsea Kerwin, April 18, 2016
April 18, 2016
What better way to train a natural language AI than to bring venerated human authors into the equation? Wired reports, “Google Wants to Predict the Next Sentences of Dead Authors.” Not surprisingly, Google researchers are tapping into Project Gutenberg for their source material. Writer Matt Burgess relates:
“The network is given millions of lines from a ‘jumble’ of authors and then works out the style of individual writers. Pairs of lines were given to the system, which made a simple ‘yes’ or ‘no’ decision to whether they matched up. Initially the system didn’t know the identity of any authors, but still only got things wrong 17 percent of the time. By giving the network an indication of who the authors were, giving it another factor to compare work against, the computer scientists reduced the error rate to 12.3 percent. This was also improved by a adding a fixed number of previous sentences to give the network more context.”
The researchers carry their logic further. As the Wired title says, they have their AI predict an author’s next sentence; we’re eager to learn what Proust would have said next. They also have the software draw conclusions about authors’ personalities. For example, we’re told:
“Google admitted its predictions weren’t necessarily ‘particularly accurate,’ but said its AI had identified William Shakespeare as a private person and Mark Twain as an outgoing person. When asked ‘Who is your favourite author?’ and [given] the options ‘Mark Twain’, ‘William Shakespeare’, ‘myself’, and ‘nobody’, the Twain model responded with ‘Mark Twain’ and the Shakespeare model responded with ‘William Shakespeare’. Asked who would answer the phone, the AI Shakespeare hoped someone else would answer, while Twain would try and get there first.”
I can just see Twain jumping over Shakespeare to answer the phone. The article notes that Facebook is also using the work of human authors to teach its AI, though that company elected to use children’s classics like The Jungle Book, A Christmas Carol, and Alice in Wonderland. Will we eventually see a sequel to Through the Looking Glass?
Cynthia Murrell, April 18, 2016
April 15, 2016
The article on eWeek titled Microsoft Debuts Azure Basic Search Tier relates the perks of the new plan from Microsoft, namely, that it is cheaper than the others. At $75 per month (and currently half of for the preview period, so get it while it’s hot!) the Basic Azure plan has lower capacity when it comes to indexing, but that is the intention. The completely Free plan enables indexing of 10,000 documents and allows for 50 megabytes of storage, while the new Basic plan goes up to a million documents. The more expensive Standard plan costs $250/month and provides for up to 180 million documents and 300 gigabytes of storage. The article explains,
“The new Basic tier is Microsoft’s response to customer demand for a more modest alternative to the Standard plans, said Liam Cavanagh, principal program manager of Microsoft Azure Search, in a March 2 announcement. “Basic is great for cases where you need the production-class characteristics of Standard but have lower capacity requirements,” he stated. Those production-class capabilities include dedicated partitions and service workloads (replicas), along with resource isolation and service-level agreement (SLA) guarantees, which are not offered in the Free tier.”
So just how efficient is Azure? Cavanagh stated that his team measured the indexing performance at 15,000 documents per minute (although he also stressed that this was with batches organized into groups of 1,000 documents.) With this new plan, Microsoft continues its cloud’s search capabilities.
Chelsea Kerwin, April 15, 2016
April 13, 2016
The article titled The 10 Commandments of Business Intelligence in Big Data on Datanami offers wisdom written on USB sticks instead of stone tablets. In the Business Intelligence arena, apparently moral guidance can take a backseat to Big Data cost-savings. Suggestions include: Don’t move Big Data unless you must, try to leverage your existing security system, and engage in extensive data visualization sharing (think Github). The article explains the importance of avoiding certain price-gauging traps,
“When done right, [Big Data] can be extremely cost effective… That said…some BI applications charge users by the gigabyte… It’s totally common to have geometric, exponential, logarithmic growth in data and in adoption with big data. Our customers have seen deployments grow from tens of billions of entries to hundreds of billions in a matter of months. That’s another beauty of big data systems: Incremental scalability. Make sure you don’t get lowballed into a BI tool that penalizes your upside.”
The Fifth Commandment remind us all that analyzing the data in its natural, messy form is far better than flattening it into tables due to the risk of losing key relationships. The Ninth and Tenth Commandments step back and look at the big picture of data analytics in 2016. What was only a buzzword to most people just five years ago is now a key aspect of strategy for any number of organizations. This article reminds us that thanks to data visualization, Big Data isn’t just for data scientists anymore. Employees across departments can make use of data to make decisions, but only if they are empowered to do so.
Chelsea Kerwin, April 13, 2016
April 11, 2016
Impacting groups like Target to JP Morgan Chase, data breaches are increasingly common and security firms are popping up to address the issue. The article Dark Web data hunter Terbium Labs secures $6.4m in fresh funding from ZDNet reports Terbium Labs received $6.4 million in Series A funding. Terbium Labs released software called Matchlight which provides real-time surveillance of the Dark Web and alerts enterprises when their organization’s data surfaces. Consumer data, sensitive company records, and trade secrets are among the types of data for which enterprises are seeking protection. We learned,
“Earlier this month, cloud security firm Bitglass revealed the results of an experiment focused on how quickly stolen data spreads through the Dark Web. The company found that within days, financial credentials leaked to the underground spread to 30 countries across six continents with thousands of users accessing the information.”
While Terbium appears to offer value for stopping a breach once it’s started, what about preventing such breaches in the first place? Perhaps there are opportunities for partnerships with Terbium and players in the prevention arena. Or, then again, maybe companies will buy piecemeal services from individual vendors.
Megan Feil, April 11, 2016
April 10, 2016
I read “Stupeflix’s Acquisition by Go Pro: The First Exalead Mafia Exit.” The write up stated:
Acquired in 2011 by Dassault Systemes, Exalead was a power-house for big data & search talent in the mid 2000’s – specifically, out of Exalead Labs, their internal ‘playground’ – and the former employees (most of the engineering team left in the years following the acquisition) have gone on to start great startups: Algolia, Dataiku, OpenDataSoft – even Disclose, our own product, is built by our CTO Guillaume Esquevin, an Exalead alumnus – and, of course, Stupeflix. Cofounders Nicolas Steegman & Francois Lagunas met during their time at Exalead.
The PayPal mafia includes Peter Thiel and a handful of other Silicon Valley luminaries. The French version of the innovation gang has generated a winner.
I noted this statement:
For the rest of the Exalead Mafia, I’ll be keeping an eye out. Another round for Dataiku may be in the works – the startup just moved into some luxurious offices overlooking the Rex theater in the heart of Paris’ startup neighborhood. Algolia’s post-YC growth has been incredible, releasing feature after feature and wooing clients. Most recently they launched Super Bowl Search site where you can search all the ads that have ever aired during the Super Bowl. Expect more great things from the Exalead Mafia.
One point: My records show that Dassault acquired Exalead in 2010 for about $160 million.
Stephen E Arnold, April 10, 2016
April 10, 2016
Short honk: I saw a Tweet about ResearchCue. According to the firm’s Web site, the service handles data aggregation for business intelligence. I checked out the report “Top Companies in Semantic Web.” A search box allows the site visitor to enter Boolean queries. The concept is that a person looking for information wants a report with snippets of relevant information automatically located and displayed in an easy-to-scan format. The presentation highlights important articles, some metrics such as the number of articles and tweets in a time period, and the list of companies in the Semantic Web sector. For vendors of keyword search solutions, this type of service is a reminder that lists of articles are not going to core the apple. Many search vendors talk about “search” and then deliver the 1970s style results. ResearchCue is making more widely available the type of information access tools I discussed in CyberOSINT: Next Generation Information Access. For traditional vendors of proprietary search systems, the future may have already passed many companies by.
Stephen E Arnold, April 10, 2016