Abandoned Books: Yep, Analytics to the Rescue
January 6, 2020
DarkCyber noted “The Most ‘Abandoned’ Books on GoodReads.” The idea is that by using available data, a list of books people could not finish reading can be generated. Disclosure: I will try free or $1.99 books on my Kindle and bail out if the content does not make me quiver with excitement.
The research, which is presented in academic finery, reports that the the author of Harry Potter’s adventurers churned out a book few people could finish. The title? The Casual Vacancy by J.K. Rowling. I was unaware of the book, but I will wager that the author is happy enough with the advance and any royalty checks which clear the bank. Success is not completion; success is money I assume.
I want to direct your attention, gentle reader, to the explanation of the methodology used to award this singular honor to J.K. Rowling, who is probably pleased as punch with the bank interaction referenced in the preceding paragraph.
Several points merit brief, very brief comment:
- Bayesian. A go to method. Works reasonably well. Guessing has its benefits.
- Data sets. Not exactly comprehensive. Amazon? What about the Kindle customer data, including time to abandonment, page of abandonment, etc.? Library of Congress? Any data to share? Top 20 library systems in the US? Got some numbers; for example, number of copies in circulation?
- Communication. The write up is a good example why some big time thinkers ignore the inputs of certain analysts.
To sum up, perhaps The Casual Vacancy may make a great gift when offered by Hamilton Books? A coffee table book perhaps?
Stephen E Arnold, January 6, 2020
Spies, Intelligence, and Publisher Motives
December 31, 2019
We are getting close to a new decade. This morning DarkCyber’s newsfeed contained two stories. These were different from the Year in Review and the What’s Ahead write ups that clog the info pipes as a year twists in the wind.
Even more interesting is the fact that the stories come from sources usually associated with recycled news releases and topics about innovations in look alike mobile phones, the antics of the Silicon Valley wizards, and gadgets rivaling the Popeil Pocket Fisherman in usefulness.
The first story is about Microsoft cracking down on a nation state which appears to have a desire to compromise US interests. “Microsoft Takes Down 50 Domains Operated by North Korean Hackers” states that:
Microsoft takes control of 50 domains operated by Thallium (APT37), a North Korean cyber-espionage group.
The write up added:
The domains were used to send phishing emails and host phishing pages. Thallium hackers would lure victims on these sites, steal their credentials, and then gain access to internal networks, from where they’d escalate their attacks even further.
DarkCyber finds this interesting. Specialist firms in the US and Israel pay attention to certain types of online activity. Now the outfit that brings the wonky Windows 10 updates and the hugely complex Azure cloud construct is taking action, with the blessing of a court. Prudent is Microsoft.
The second write up is “‘Shattered’: Inside the Secret Battle to Save America’s Undercover Spies in the Digital Age.” The write up appears to be the original work of Yahoo, a unit of Verizon. The article explains a breach and notes:
Whether the U.S. intelligence agencies will be able to make these radical changes is unclear, but without a fundamental transformation, officials warn, the nation faces an unprecedented crisis in its ability to collect human intelligence. While some believe that a return to tried and true tradecraft will be sufficient to protect undercover officers, others fear the business of human spying is in mortal peril and that the crisis will ultimately force the U.S. intelligence community to rethink its entire enterprise.
Note that the Yahoo original news story runs about 6,000 words. Buy a hot chocolate, grab a bagel, and chill as you work through the compilation of government efforts to deal with security, bad actors, bureaucratic procedures, and assorted dangers, clear, unclear, present, and missing in action. On the other hand, you can wait for the podcast because the write up seems to have some pot boiler characteristics woven through the “news.”
Read the original stories.
DarkCyber formulated several observations. Here they are:
- Will 2020 be the year of intelligence, cyber crime, and government missteps related to security?
- Why are ZDNet and Yahoo (both outfits with a history of wobbling from news release to news release) getting into what seems to be popularization of topics once ignored. Clicks? Ad dollars? Awards for journalism?
- What will stories like these trigger? One idea is that bad actors may become sufficiently unhappy to respond. Will these responses be a letter to the editor? Maybe. Maybe not. Unintended consequences may await.
This new interest of ZDNet and Yahoo may be a story in itself. Perhaps there is useful information tucked into the Yahoo Groups which Verizon will be removing from public access in a couple of weeks. And what about that Microsoft activity?
Stephen E Arnold, December 31, 2019
The Intercept Says Happy Holidays to Thomson Reuters
December 23, 2019
I read “How Ice Uses Social Media to Surveil and Arrest Immigrants.”
DarkCyber’s reaction to this story was, “What did Thomson Reuters do to warrant this Happy New Year greeting?” The good folks at Thomson Reuters are not the largest nor the only source of information for analysts—both commercial and governmental. The write describes a routine method of cross correlating items of information. The write up mentions a number of other outfits selling data to organizations. Hello, this is the commercial database business. The sector includes hundreds of companies, not just those who had a mostly forgotten connection to Lord Thomson of Fleet.
Please, sir, may I have some rich, hearty soup, not thin gruel?
A few observations:
- What other firms provide commercial data services to government agencies? Hint: LexisNexis, Experian, other government agencies, and lots, lots more.
- When did this business begin? What were the first commercial firms operating in this business sector? Hint: History can be interesting if one goes back to the the days of RECON and SDC.
- What are the sources of data available to entities which are not allies of the United States? Hint: Singapore’s information sector is booming for a reason.
But the big red herring in the write up is the failure to address the one important weakness in most of the existing data services. What do we get? Thin porridge like that fed Tiny Tim.
My point is that focusing on Thomson Reuters is a misrepresentation of how data can be cross correlated. What happens if a new service becomes available which provides a meta service? That’s a story.
If you want to obtain a copy of a report which describes one new service taking shape, send an email to darkcyber333 at yandex dot com. A government or company email address is required. Will there be exceptions? Nope.
No Happy New Years to Thomson Reuters from the Intercept and none from me for those wanting a document without the required email type.
I know, “Humbug.”
Stephen E Arnold, December 23, 2019
Some Free Math Books
December 21, 2019
Dana C. Ernst has assembled a list of free and open source math texts. The list is useful and contains a range of information. Books are grouped by category unlike the Barnes & Noble and Amazon approaches of ignoring meaningful topic clusters. Want Calculus. You get calculus, not a book priced at $750. And for the exercise physiologists, lawyers, and home economics enthusiasts, the list includes “Introductory Differential Equations Using Sage.” No, sage is not a spice. Too bad the link is dead, but the Mathematical Association of America will sell David Joyner and Marshall Hampton’s book for just $60. You can access the list at this link.
Stephen E Arnold, December 21, 2019
Google Management Method Called Interrogation by CNBC
November 21, 2019
DarkCyber, happily ensconced in rural Kentucky, does not know if the information in “Google Employees Protested the Interrogation of Two Colleagues by Company’s Investigations Team, Memo Says” is accurate.
But the headline alone is quite interesting. The news story states:
The memo said Berland’s [a Google employee objecting to certain Google projects] questioning lasted 2.5 hours and was conducted by Google’s global investigations team, which allegedly told the employees that they were “not decision-makers” but that they would relay the workers’ message “up the chain.”
The memo seems to have been written by Googlers unhappy with the interaction of some Google professionals and two employees who had voiced concerns about the company’s work for the US government.
Please, read the original CNBC story.
DarkCyber jotted down several observations while two of my team and I tried to figure out who was on first:
1. The meeting was described as an interrogation. That in itself is an interesting word. Maybe interrogation is the wrong word, but it is clear that the meeting was not the equivalent of what my mother called a “kaffeeklatsch.”
2. The meeting involved an investigations team. DarkCyber did not know that Google had such a team, but presumably CNBC is confident that the ever popular online advertising company does. Does the investigations team have a uniform or maybe a badge with the cheerful Google logo?
3. Two and a half hours. My goodness. That’s longer than many feature films. The length of time brings some images to the forefront of the DarkCyber team’s hive mind. Here’s one that one of the programmer analysts called up from his Apple iPhone. (The objectivity of the iPhone search function must be considered, if not investigated.)
A cheerful setting for an informal chat or not?
Net net: If the CNBC story is accurate, Google’s management methods are quite interesting. Not even the high school science club to which I belonged in 1958 considered interrogation of non science club members. Grilling a science club member was simply not on our club members’ radar.
How times have changed!
Stephen E Arnold, November 21, 2019
False News: Are Smart Bots the Answer?
November 7, 2019
To us, this comes as no surprise—Axios reports, “Machine Learning Can’t Flag False News, New Studies Show.” Writer Joe Uchill concisely summarizes some recent studies out of MIT that should quell any hope that machine learning will save us from fake news, at least any time soon. Though we have seen that AI can be great at generating readable articles from a few bits of info, mimicking human writers, and even detecting AI-generated stories, that does not mean they can tell the true from the false. These studies were performed by MIT doctoral student Tal Schuster and his team of researchers. Uchill writes:
“Many automated fact-checking systems are trained using a database of true statements called Fact Extraction and Verification (FEVER). In one study, Schuster and team showed that machine learning-taught fact-checking systems struggled to handle negative statements (‘Greg never said his car wasn’t blue’) even when they would know the positive statement was true (‘Greg says his car is blue’). The problem, say the researchers, is that the database is filled with human bias. The people who created FEVER tended to write their false entries as negative statements and their true statements as positive statements — so the computers learned to rate sentences with negative statements as false. That means the systems were solving a much easier problem than detecting fake news. ‘If you create for yourself an easy target, you can win at that target,’ said MIT professor Regina Barzilay. ‘But it still doesn’t bring you any closer to separating fake news from real news.’”
Indeed. Another of Schuster’s studies demonstrates that algorithms can usually detect text written by their kin. We’re reminded, however, that just because an article is machine written does not in itself mean it is false. In fact, he notes, text bots are now being used to adapt legit stories to different audiences or to generate articles from statistics. It looks like we will just have to keep verifying articles with multiple trusted sources before we believe them. Imagine that.
Cynthia Murrell, November 7, 2019
A New Private Company Directory Entering the Information Super Highway
November 1, 2019
DarkCyber spotted “Crunchbase Raises $30 Million to Go after Private Companies’ Data.” Business directories can be lucrative. Just track down and old school Dun & Bradstreet senior manager.
The approach taken by Crunchbase, which for a short period of time, was a Verizon property, consists of several parts:
- Tracking information about private companies
- Inclusion of information that will make the directory like LinkedIn, the Microsoft job hunting and social networking site
- A modern-day service able to host corporate Web sites (maybe a 21st city Geocities?). The idea is to capture “partnership and careers pages.”
The write up describes Crunchbase as “one of the largest publicly accessible repositories of data about private companies.”
We learned:
Crunchbase partners with more than 4,000 data suppliers that provide it with valuable information on startup companies, such as annual revenue or burn rate.
Oracle provides a data marketplace and Amazon may be spinning up its streaming data marketplace. Will Crunchbase partner, compete, or sell to either of these companies?
Once in a while, DarkCyber looks up a company on Crunchbase. The experience is a “begging for dollars” journey. The useful information has been trimmed in order to get DarkCyber to sign up for hundreds of dollars to look up information about a private company easily findable elsewhere. A good source are Web sites of the outfits pumping cash into startups, tweets, and discussion groups.
Can the $30 million succeed where other directories have found themselves operated by trade associations or intelligence software equipped with a data base of open source information?
Worth watching. We know the investors have their eyes open as will Cengage, possibly the proud producers of Ward’s Business Directory of US Private and Public Companies.
Stephen E Arnold, November 1, 2019
Gender Bias in Old Books. Rewrite Them?
October 9, 2019
Here is an interesting use of machine learning. Salon tells us “What Reading 3.5 Million Books Tells Us About Gender Stereotypes.” Researchers led by University of Copenhagen’s Dr. Isabelle Augenstein analyzed 11 billion English words in literature published between 1900 and 2008. Not surprisingly, the results show that adjectives about appearance were most often applied to women (“beautiful” and “sexy” top the list), while men were more likely to be described by character traits (“righteous,” “rational,” and “brave” were most frequent). Writer Nicole Karlis describes how the team approached the analysis:
“Using machine learning, the researchers extracted adjectives and verbs connected to gender-specific nouns, like ‘daughter.’ Then the researchers analyzed whether the words had a positive, negative or neutral point of view. The analysis determined that negative verbs associated with appearance are used five times more for women than men. Likewise, positive and neutral adjectives relating to one’s body appearance occur twice as often in descriptions of women. The adjectives used to describe men in literature are more frequently ones that describe behavior and personal qualities.
“Researchers noted that, despite the fact that many of the analyzed books were published decades ago, they still play an active role in fomenting gender discrimination, particularly when it comes to machine learning sorting in a professional setting. ‘The algorithms work to identify patterns, and whenever one is observed, it is perceived that something is “true.” If any of these patterns refer to biased language, the result will also be biased,’ Augenstein said. ‘The systems adopt, so to speak, the language that we people use, and thus, our gender stereotypes and prejudices.’” Augenstein explained this can be problematic if, for example, machine learning is used to sift through employee recommendations for a promotion.”
Karlis does list some caveats to the study—it does not factor in who wrote the passages, what genre they were pulled from, or how much gender bias permeated society at the time. The research does affirm previous results, like the 2011 study that found 57% of central characters in children’s books are male.
Dr. Augenstein hopes her team’s analysis will raise awareness about the impact of gendered language and stereotypes on machine learning. If they choose, developers can train their algorithms on less biased materials or program them to either ignore or correct for biased language.
Cynthia Murrell, October 9, 2019
Thomson Reuters: Getting with the Conference Crowd
October 6, 2019
DarkCyber noted “Thomson Reuters acquires FC Business Intelligence.” FCBI, according the the firm’s Web site:
Founded round a kitchen table in 1990, originally with a focus on emerging markets, the company has grown organically in size and influence ever since.
We learned:
The business will be rebranded Reuters Events and will be operated as part of the Reuters News division of Thomson Reuters.
Thomson Reuters has not delivered hockey stick growth in the last three, five, eight years, has it?
Will conferences be the goose which puts golden eggs in the Thomson Reuters’ hen house?
What’s the motive force for a professional publishing outfit to get into conferences? DarkCyber hypothesizes that:
- Getting more cash from traditional professional publishing markets is getting more difficult; for example, few law firms have clients willing to pay the commercial online fees from the “good old days”
- Conferences, despite advances in technology, continue to give the Wall Street Journal and other organizations opportunities to meet and greet, generate revenue from booth rentals, and a way to hop on hot topics
- Respond to the painful fact that it is easier to make one’s own news instead of paying to just report the news, particularly if it comes from a high profile conference.
Will Thomson Reuters slice and dice the content outputs in as many ways as possible? Possibly.
Worth watching as Lord Thomson of Fleet probably is from his eye in the sky.
Stephen E Arnold, October 6, 2019
An AI Tool to Identify AI-Written Text
September 19, 2019
When distinguishing human writing from AI-generated text, the secret is in the predictability. MIT Technology Review reports, “A New Tool Uses AI to Spot Text Written by AI.” We have seen how AI can produce articles that seem to us humans as if they were written by one of us, opening a new dimension in the scourge of fake news. Now, researchers have produced a tool that uses AI technology to detect AI-generated text. Writer Will Knight tells us:
“Researchers from Harvard University and the MIT-IBM Watson AI Lab have developed a new tool for spotting text that has been generated using AI. Called the Giant Language Model Test Room (GLTR), it exploits the fact that AI text generators rely on statistical patterns in text, as opposed to the actual meaning of words and sentences. In other words, the tool can tell if the words you’re reading seem too predictable to have been written by a human hand. … GLTR highlights words that are statistically likely to appear after the preceding word in the text. As shown in the passage above (from Infinite Jest), the most predictable words are green; less predictable are yellow and red; and least predictable are purple. When tested on snippets of text written by OpenAI’s algorithm, it finds a lot of predictability. Genuine news articles and scientific abstracts contain more surprises.”
See the article for that colorfully highlighted sample. Researchers enlisted Harvard students to test GLTR’s results. Without the tool, students spotted just half the AI-crafted passages. Using the highlighted results, though, they identified 72% of them. Such collaboration between the tool and human interpreters is the key to warding off fake articles, one researcher states. The article concludes with a link to try out the tool for oneself.
Cynthia Murrell, September 19, 2019