Can Machine Learning Pick Out The Bullies?
November 13, 2019
In Walt Disney’s 1942 classic Bambi, Thumper the rabbit was told, “If you can’t say something nice, don’t say nothing at all.”
Poor grammar aside, the thumping rabbit did delivered wise advice to the audience. Then came the Internet and anonymity, when the trolls were released to the world. Internet bullying is one of the world’s top cyber crimes, along with identity and money theft. Passionate anti-bullying campaigners, particularly individuals who were cyber-bullying victims, want social media Web sites to police their users and prevent the abusive crime. Trying to police the Internet is like herding cats. It might be possible with the right type of fish, but cats are not herd animals and scatter once the tasty fish is gone.
Technology might have advanced enough to detect bullying and AI could be the answer. Innovation Toronto wrote, “Machine Learning Algorithms Can Successfully Identify Bullies And Aggressors On Twitter With 90 Percent Accuracy.” AI’s biggest problem is that algorithms can identify and harvest information, they lack the ability to understand emotion and context. Many bullying actions on the Internet are sarcastic or hidden within metaphors.
Computer scientist Jeremy Blackburn and his team from Binghamton University analyzed bullying behavior patterns on Twitter. They discovered useful information to understand the trolls:
“ ‘We built crawlers — programs that collect data from Twitter via variety of mechanisms,’ said Blackburn. ‘We gathered tweets of Twitter users, their profiles, as well as (social) network-related things, like who they follow and who follows them.’ ”
The researchers then performed natural language processing and sentiment analysis on the tweets themselves, as well as a variety of social network analyses on the connections between users. The researchers developed algorithms to automatically classify two specific types of offensive online behavior, i.e., cyber bullying and cyber aggression. The algorithms were able to identify abusive users on Twitter with 90 percent accuracy. These are users who engage in harassing behavior, e.g. those who send death threats or make racist remarks to users.
“‘In a nutshell, the algorithms ‘learn’ how to tell the difference between bullies and typical users by weighing certain features as they are shown more examples,’ said Blackburn.”
Blackburn and his teams’ algorithm only detects the aggressive behavior, it does not do anything to prevent cyber bullying. The victims still see and are harmed by the comments and bullying users, but it does give Twitter a heads up on removing the trolls.
The anti-bullying algorithm prevents bullying only after there are victims. It does little assist the victims, but it does prevent future attacks. What steps need to be taken to prevent bullying altogether? Maybe schools need to teach classes on Internet etiquette with the Common Core, then again if it is not on the test it will not be in a classroom.
Whitney Grace, November 13, 2019
Medical Data: A Google Focus for More Than a Decade
November 12, 2019
Medical data. Google has a bit of history. In 2008, Google made a play for personal health records. Don’t remember. Here’s what the interface looked like:
In 2011, this bold play went away. Doesn’t that sound familiar? A discontinued Google service.
Then Google bought DeepMind, the black hole of investment in the UK. DarkCyber noted this story: “Revealed: Google AI Has Access to Huge Haul of NHS Patient Data.” The write up stated:
A data-sharing agreement obtained by New Scientist shows that Google DeepMind’s collaboration with the NHS goes far beyond what it has publicly announced.
There was a dust up, but The Register reported: “Five NHS Trusts Do DeepMind Data Deal with Google. One Says No.”
DarkCyber noted the flurry of reports about Google’s tie up with Ascension, the second largest health care outfit in the US. You can read the paywalled Wall Street Journal story or you can look at one of the dozens of posts recycling this deal.
A few comments, perhaps? Why not?
First, Google has been beavering away at personal health data, including the famous CDC flue report, for more than a decade. Why? That’s a good question.
Second, Google needs new revenue. I know it sounds crazy, but the ad biz is not the same old money machine it was because the cost of “being Google” is rising more rapidly than Google’s old money machine can handle. That’s why YouTube will cuts costs by trimming un-commercial videos. Plus, there are other problems; for example, Google’s famous management style. Health data may open some revenue opportunities? Yep, a handful.
Third, Google’s information is asymmetric. There is a lot of data from Web sites, books, and other open sources. But Google is a laggard when it comes to juicy, useful, easily exploitable fine grained personal data in the hands of Amazon and Facebook. Health data is a useful goodie. Health data is proprietary and quite person centric.
What can Google do with health data? Many things. But those applications are secondary in this blog post. The point today, gentle reader, is that Google is not doing anything new. Health data has been a focal point for a relatively long time.
Oh, would you buy Google insurance? No. Would your would be employer buy information revealing a person was addicted to something? No. You might want to think about your answer. What about personalized ads to the parents of a child with an “issue”? No. Okay. No.
Stephen E Arnold, November 12, 2019
Google: Chronicle Is Not a Sci Fi Disaster Film. It Just Seems Like It
November 12, 2019
“Google’s Cybersecurity Project ‘Chronicle’ Imploding” may not be true. If the information in the Economic Times is accurate, Google has created another business school case study about Silicon management methods, what DarkCyber describes with this acronym HSSCMM (high school science club management methods).
In 2018 Alphabet, the rejiggered “owner” of Google was created to be what the write called “an independent start up.”
Yeah, that sounds good.
The goal of Chronicle was modest: “Revolutionize cybersecurity.”
Yeah, that sounds even better.
Engadget reported in June 2019:
The cybersecurity company launched in January 2018, and it released its first commercial product, Backstory, in March. In a blog post, Chronicle CEO and co-founder Stephen Gillett said Google Cloud’s cybersecurity tools and Chronicle’s Backstory and VirusTotal are complementary and will be leveraged together.
The Economic times’ write up states:
Google’s cybersecurity project named “Chronicle” is imploding in trouble and some employees feel its management “abandoned and betrayed” the original vision, media reports said.
Staff, including the CEO, have looked for green pastures elsewhere. Chronicle was moved back to the Google mother ship. Salaries were a sore point. It seems Chronicle employees were paid less than other “real” Googlers.
Let’s assume that the information is maybe, sort of accurate. In this non sci-fi thought space, here are some observations:
- Thinking, assembling, announcing, and doing can be enhanced with management. No management, problems. Google seems beset with some non-linear challenges.
- The life span of this Google activity seems brief: January 2018 to November 2019. Is the time between launch and problems becoming more abbreviated?
- Google’s moon shot factory may be veering more and more into a boundary world: Big ideas fail due to the humans working on creating a reality.
To sum up: Chronicle may be another marker on the management superhighway. On the other hand, the Chronicle issue is real.
We’re back to Jorge Luis Borges, the Argentinean writer, who observed:
Reality is not always probable, or likely.
My high school science club was unreal but real as well. Click here for the theme song to Chronicle. Sorry, I meant Twilight Zone.
Stephen E Arnold, November 11, 2019
Curious about Semantic Search the SEO Way?
November 12, 2019
DarkCyber is frequently curious about search: Semantic, enterprise, meta, multi-lingual, Boolean, and the laundry list of buzzwords marshaled to allow a person to find an answer.
If you want to get a Zithromax Z-PAK of semantic search talk, navigate to ‘Semantic Search Guide.” One has to look closely at the url to discern that this “objective” write up is about search engine optimization or SEO. DarkCyber affectionately describes SEO as the “relevance” killer, but that’s just our old-fashioned self refusing to adapt to the whizzy new world.
The link will point to a page with a number of links. These include:
- Target audience and contributions
- The knowledge graph explained
- The evolution of search
- Using Google’s entity search tool
- Getting a Wikipedia listing
DarkCyber took a look at the “Evolution of Search” segment. We found it quirky but interesting. For example, we noted this passage:
Now we turn to the heart of full-text search. SEOs tend to dwell on the indexing part of search or the retrieval part of the search, called the Search Engine Results Pages (SERPs, for short). I believe they do this because they can see these parts of the search. They can tell if their pages have been crawled, or if they appear. What they tend to ignore is the black box in the middle. The part where a search engine takes all those gazillion words and puts them in an index in a way that allows for instant retrieval. At the same time, they are able to blend text results with videos, images and other types of data in a process known as “Universal Search”. This is the heart of the matter and whilst this book will not attempt to cover all of this complex subject, we will go into a number of the algorithms that search engines use. I hope these explanations of sometimes complex, but mostly iterative algorithms appeal to the marketer inside you and do not challenge your maths skills too much. If you would like to take these ideas in in video form, I highly recommend a video by Peter Norvig from Google in 2011: https://www.youtube.com/watch?v=yvDCzhbjYWs
Oh, well. This is one way to look at universal search. But Google has silos of indexes. The system after 20 plus years does not federate results across indexes. Semantic search? Yeah, right. Search each index, scan results, cut and paste, and then try to figure out the dates and times. Semantic search does not do time particularly well.
Important. Not to the SEO. Search babble may be more compelling.
If this approach is your cup of tea, inLinks has the hot water you need to understand why finding information is not what it seems.
Stephen E Arnold, November 12, 2019
Find PDF Books. Exercise Caution, However
November 12, 2019
DarkCyber spotted a list of services which purport to find PDF books. The information appeared in “7 PDF Search Engines To Search And Download Free PDF Books.” For the complete list, navigate to the original story. Be aware, there are some annoying popups to distract one from the content of the article. We noted these three, but we are not prepared to offer a value judgment about the comprehensiveness of the index or the ultimate availability of the PDF document:
- SoPDF. This service has an alleged 43,000,000 free PDF books. Here’s the link.
- FreeFullPDF. This is a collection of sci-tech-med content, allegedly 80,000,000. Here’s the link.
- FileSearch Box PDF Search. This is a Google custom search engine. In theory, one can find these documents via the Google search box, but your mileage may vary. BERT and Ernie now have jobs at Google. Here’s the link.
Oh, some of the items in the original article may be malicious. Exercise judgment in your quest for free information.
Stephen E Arnold, November 12, 2019
The Key to Millions: Enterprise Search?
November 11, 2019
I thought the world was crazier than ever when enterprise search became the focal point of a multi-billion dollar deal and a multi-year lawsuit. The open source search movement picked up steam as companies shifted their attention from proprietary search and retrieval solutions to those maintained by a “community.” Search became a utility which many information technology professionals found a Bermuda Triangle for careers.
Why?
Our research prior to the publication of the three volumes of the Enterprise Search Report I wrote and our subsequent work on next generation search solutions revealed these problems:
- Enterprise search implies one size fits all. Information retrieval needs vary by business unit, department, and individuals. When one pokes around a large organization, one finds numerous search and information access systems. One size? Nope.
- Users look for information in the enterprise search system and cannot locate it. The reasons vary, but the universal gripe is, “I can’t locate the document I just saved.” The notion of real time is not one that fits into more organization’s information infrastructure. Cost is one big reason. What looks good in a demo does not work in the “real world” of a company.
- Silos. The implications of “enterprise” suggest that a significant amount of information will be available to a user of the search system. Nothing could be further from the reality. Legal keeps some documents under lock and key. Personnel? The same approach. Research? No data goes out of the lab or the researcher’s workstation. On and on.
- Changes that are not captured. The top sales professional changes his presentation right before giving a talk to seal a big deal. The changes are not indexed because the sales professional has to do the contract. Missing info? Yes.
- Untracked digital information. Enterprise search has not been either quick nor adept at handling social media posts (authorized or unauthorized), interviews, videos produced in lieu of a written report, and similar information objects. Try to find key facts from these content collections. Give up yet.
I could extend this list, but I don’t have the energy. Few are interested in what caused Entopia to go out of business. No one I have spoken with in the last five years cares about why Fast Search & Transfer self destructed. No one cares.
I read “Want to Earn Millions? Launch an AI Based Enterprise Search Startup.” That’s a path to fame and riches. The write up states:
Enterprise search engines based on artificial intelligence systems are taking off fast. Cognitive search systems using NLP can include structured data contained in databases and even nontraditional enterprise information like pictures, video, sound, and machine information, for example, from the internet of things (IoT) gadgets, to bring contextual results in the actual business context.
Sounds good. How about this?
For startups and venture investing, the trend is clear. One prime example of this trend is the world’s leading space agency- NASA has enormous data ever since it was created in 1958. Now, the agency is working to make its data increasingly accessible for rocket designers and researchers. It is redesigning search and analytics abilities utilizing AI and natural language processing (NLP) systems created by a company known as Sinequa which is collaborating with the agency to deploy a worldwide knowledge management ability.
Amazing. Technologies like RECON’s which NASA helped move forward because engineers could not locate key documents is looking at technology which has wobbled from search to intelligence and back again.
A quick reality check, gentle reader, please.
One can download open source search and retrieval software and get decent results. But there are firms which have goosed the “money” in enterprise search to astronomical levels:
- Algolia, $100 million
- Coveo, $200 million
- LucidWorks, $150 million
- ThoughtSpot, $248 million.
Now let’s think about Autonomy. At its height, the company reported revenues of about $800 million. HP paid $10.3 billion. After a short period of time, HP realized its massive sales and marketing system could not generate enough new sales and sustainable revenue to keep the Autonomy business an alleged winner.
How will these companies pitching enterprise search generate sufficient revenue to pay back their investors, fund research and development, add filters and other components needed to deal with today’s content flows, and support their existing systems as licensees try to make search work like investigative software?
The answer is, “The odds are quite unappealing.”
- Enterprise search has been available for half a century with some of the old school systems still available from OpenText in the guise of BRS Search
- Dissatisfaction with enterprise search systems generally runs about 50 to 70 percent in most organizations with such a system
- Costs of keeping an enterprise search and retrieval system continue to creep up despite the advent of managed services like those available from Amazon and others
Where are the customers?
That’s the question the article ignores.
Customers are likely to be just as tough to convince to use an enterprise solution as they have been for decades.
Net net: Enterprise search may not be the spring chicken the write up describes. Enterprise search has a history. And history is about to repeat itself. When the Autonomy matter is resolved, there may be be a new search drama to follow.
Keep in mind that Google couldn’t make enterprise search work. But these cash stuffed outfits can? Maybe? Well, probably not.
Stephen E Arnold, November 11, 2019
Tech Backlash: Not Even Apple and Goldman Sachs Exempt
November 11, 2019
Times are indeed interesting. Two powerful outfits—Apple (the privacy outfit with a thing for Chinese food) and Goldman Sachs (the we-make-money-every way possible organization) are the subject of “Viral Tweet about Apple Card Leads to Goldman Sachs Probe.” The would-be president’s news machine stated, “Tech entrepreneur alleged inherent bias in algorithms for card.” The card, of course, is the Apple-Goldman revenue-generating credit card. Navigate to the Bloomberg story. Get the scoop.
On the other hand, just look at one of the dozens and dozens of bloggers commenting about this bias, algorithm, big name story. Even more intriguing is that the aggrieved tweeter’s wife had her credit score magically changed. Remarkable how smart algorithms work.
DarkCyber does not want to retread truck tires. We do have three observations:
- The algorithm part may be more important than the bias angle. The reason is that algorithms embody bias, and now non-technical and non-financial people are going to start asking questions: Superficial at first and then increasingly on point. Not good for algorithms when humans obviously can fiddle the outputs.
- Two usually untouchable companies are now in the spotlight for subjective, touchy feely things with which neither company is particularly associated. This may lead to some interesting information about what’s up in the clubby world of the richest companies in the world. Discrimination maybe? Carelessness? Indifference? Greed? We have to wait and listen.
- Even those who may have worked at these firms and who now may be in positions of considerable influence may find themselves between a squash wall and sweaty guests who aren’t happy about an intentional obstruction. Those corporate halls which are often tomb-quiet may resound with stressed voices. “Apple” carts which allegedly sell to anyone may be upset. Cleaning up after the spill may drag the double’s partners from two exclusive companies into a task similar to cleaning sea birds after the gulf oil spill.
Will this issue get news traction? Will it become a lawyer powered railroad handcar creeping down the line?
Fascinating stuff.
Stephen E Arnold, November 11, 2019
Quantum Cryptography: Rain on the Parade
November 11, 2019
I know (not too well, which may be a good thing) who is trying to cash in the quantum gold rush. The angle for this entrepreneur is that quantum computing will allow government entities to break encryption.
The hitch in the git along is that there are bad actors who are involved in quantum computing. There are good actors who are creating quantum-safe cryptography with quantum computers.
Confused? Don’t be. People who need to encrypt gravitate to the high horsepower computers. The people who want to break encryption do what’s necessary to get access to quantum computers. The method used by Saudi Arabia to obtain specific social media data worked like a champ.
That brings me to “Komodo to Lead Blockchain Revolution with Quantum-Safe Cryptography.” The write up says:
As a blockchain platform, Stadelmann said that Komodo is trying to solve the problem and has implemented quantum-safe cryptographic solutions for the past couple of years which will not be able to crack cryptographic signatures. Using an IBM-built technology, known as Dilithium, into its blockchain platform, he said the new digital signature algorithm will create a key which cannot be cracked by a quantum computer.
Sounds good. Just another cat and mouse game. The people working to cash in on this scare tactic may find that organizations face the status quo, not doomsday. Confused? Just the status quo perhaps?
Stephen E Arnold, November 11, 2019
Buzzword Originator: Bits from LinkedIn
November 11, 2019
DarkCyber spotted “What on Earth Is a Data Scientist? The Buzzword’s Inventor DJ Patil Spills All.” The write up contained an interesting factoid:
The term “data scientist,” virtually unheard of just a few years ago, now returns over 25,000 results on LinkedIn’s Jobs page—that’s a solid 2,000 more than the search results of the universally trendy “financial analyst” (at least to us New Yorkers).
How valid is the phrase? We noted this statement from DJ Patel, a former LinkedIn professional and adviser to President Obama:
“But because I was working at LinkedIn, I just tested all the job titles we could think of to see which one would get the most interest from job applicants. Turns out that everybody wanted to be a data scientist, so we’re like, OK, that is what we will call ourselves.”
The hot title? Just marketing it seems.
Stephen E Arnold, November 12, 2019
Google: Bert Search Is Here. Where Is Ernie Advertising?
November 10, 2019
Google wants to stay at the top of search, so they are constantly developing new technology to keep their search algorithms ahead the competition. Fast Company shares the latest on Google’s search technology in the article, “Google Just Got Better At Understanding Your Trickiest Searches.” Search queries power all of Google searches and the problem for search algorithms is understanding which words in the query are the most important. Another issue is that the algorithms need to understand how the words relate to one another. The relationship between keywords and their intent is subtle, particularly with all the subtle meanings in the English language.
Google’s newest search algorithm endeavor is dubbed BERT, short for Bidirectional Encoder Representations from Transformers. What does that mean?
“We non-AI scientists don’t have to worry about what encoders, representations, and transformers are. But the gist of the idea is that BERT trains machine language algorithms by feeding them chunks of text that have some of the words removed. The algorithm’s challenge is to guess the missing words—which turns out to be a game that computers are good at playing, and an effective way to efficiently train an algorithm to understand text. From a comprehension standpoint, it helps “turn keyword-ese into language,” said Google search chief Ben Gomes.”
Apparently the more text fed into a search, the better BERT can interpret its meaning. Google search scientists tested BERT by feeding the algorithm an endless stream of text from the search engine results. The “bidirectional” in BERT’s name comes from how the algorithm interprets data. Traditional search algorithms read English search queries from left to right, while BERT’s bidirectional reads the queries from unconventional ways.
The average user will not recognize that BERT has altered their search results, but it will be beneficial to them. BERT will not have the same reaching impact as universal search and knowledge graph, but it does give Google a competitive advantage.
The Wall Street Journal did some Google related sleuthing. The focus is advertising. You can read the story and look at the very millennial diagram in “How Google Edged Out Rivals and Built the World’s Dominant Ad Machine: A Visual Guide.” You will have to pay to learn what the diagram shown below means. You will also have to do some homework to figure out how advertising and search / retrieval are connected. That’s important to some. But that diagram is remarkable. It uses Google colors too.
Whitney Grace, November 10, 2019