CyberOSINT: Next Generation Information Access Explains the Tech Behind the Facebook, GSR, Cambridge Analytica Matter
April 5, 2018
In 2015, I published CyberOSINT: Next Generation Information Access. This is a quick reminder that the profiles of the vendors who have created software systems and tools for law enforcement and intelligence professionals remains timely.
The 200 page book provides examples, screenshots, and explanations of the tools which are available to analyze social media information. The book is the most comprehensive run down of the open source, commercial, and cloud based systems which can make sense of social media data, lawful intercept data, and general text and imagery content.
Companies described in this collection of “tools” include:
- Cyveillance (now LookingGlass)
- Decisive Analytics
- IBM i2 (Analysts Notebook)
- Geofeedia
- Leidos
- Palantir Gotham
- and more than a dozen developers of commercial and open source, high impact cyberOSINT tool vendors.
The book is available for $49. Additional information is available on my Xenky.com Web site. You can buy the PDF book online at this link gum.co/cyberosint.
Get the CyberOSINT monograph. It’s the standard reference for practical and effective analysis, text analytics, and next generation solutions.
Stephen E Arnold, April 5, 2018
Insight into the Value of Big Data and Human Conversation
April 5, 2018
Big data and AI have been tackling tons of written material for years. But actual spoken human conversation has been largely overlooked in this world, mostly due to the difficulty of collecting this information. However, that is on the cusp of changing as we discovered from a white paper from the Business and Local Government Resource Center, “The SENSEI Project: Making Sense of Human Conversations.”
According to the paper:
“In the SENSEI project we plan to go beyond keyword search and sentence-based analysis of conversations. We adapt lightweight and large coverage linguistic models of semantic and discourse resources to learn a layered model of conversations. SENSEI addresses the issue of multi-dimensional textual, spoken and metadata descriptors in terms of semantic, para-semantic and discourse structures.”
While some people are excited about the potential for advancement this kind of big data research presents, others are a little more nervous; for example, one or two of the 87 million individuals whose Facebook data found its way into the capable hands of GSR and Facebook.
In fact, there is a growing movement, according to the Guardian, to scale back big data intrusion. What makes this interesting is that advocates are demanding companies that harvest our information for big data purposes give some of that money back to the people whom the info originate, not unlike how songwriters are given royalties every time their music is used for film or television. Putting a financial stipulation on big data collection could cause SENSEI to top its brake pedal. Maybe?
Patrick Roland, April 5, 2018
Can Factmata Do What Other Text Analytics Firms Cannot?
April 2, 2018
Consider it a sign of the times—Information Management reveals, “Twitter, Craigslist Co-Founders Back Fact-Check Startup Factmata.” Writer Jeremy Kahn reports:
“Twitter Inc. co-founder Biz Stone and Craigslist Inc. co-founder Craig Newmark are investing in London-based fact-checking startup Factmata, the company said Thursday. … Factmata aims to use artificial intelligence to help social media companies, publishers and advertising networks weed out fake news, propaganda and clickbait. The company says its technology can also help detect online bullying and hate speech.”
Particularly amid concerns about the influence of Russian-backed propaganda in U.S. and the U.K., several tech firms and other organizations have taken aim at false information online. What about Factmata has piqued the interest of leading investors? We’re informed:
“Dhruv Ghulati, Factmata’s chief executive officer, said the startup’s approach to fact-checking differs from other companies. While some companies are looking at a wide range of content, Factmata is initially focused exclusively on news. Many automated fact-checking approaches rely primarily on metadata – the information behind the scenes that describe online news items and other posts. But Factmata is using natural language processing to assess the actual words, including the logic being used, whether assertions are backed up by facts and whether those facts are attributed to reputable sources.”
Ghulati goes on to predict Facebook will be supplanted as users’ number one news source within the next decade. Apparently, we can look forward to the launch of Factmata’s own news service sometime “later this year.”
We will wait. We do want to point out that based on the information available to the Beyond Search and DarkCyber research teams, no vendor has been able to identify text which is weaponized at a high level of accuracy without the assistance of expensive, human, and vacation hungry subject matter experts.
Maybe Factmata will “mata”?
Cynthia Murrell, April 2, 2018
What Happens When Intelligence Centric Companies Serve the Commercial and Political Sectors?
March 18, 2018
Here’s a partial answer:
And
Plus
Years ago, certain types of companies with specific LE and intel capabilities maintained low profiles and, in general, focused on sales to government entities.
How times have changed!
In the DarkCyber video news program for March 27, 2018, I report on the Madison Avenue type marketing campaigns. These will create more opportunities for a Cambridge Analytica “activity.”
Net net: Sometimes discretion is useful.
Stephen E Arnold, March 18, 2018
Searching Video and Audio Files is Now Easier Than Ever
February 7, 2018
While text-based search has been honed to near perfection in recent years, video and audio search still lags. However, a few companies are really beginning to chip away at this problem. One that recently caught our attention was VidDistill, a company that distills YouTube videos into an indexed list.
According to their website:
vidDistill first gets the video and captions from YouTube based off of the URL the user enters. The caption text is annotated with the time in the video the text corresponds to. If manually provided captions are available, vidDistill uses those captions. If manually provided captions are not available, vidDistill tries to fall back on automatically generated captions. If no captioning of any sort is available, then vidDistill will not work.
Once vidDistill has the punctuated text, it uses a text summarization algorithm to identify the most important sentences of the entire transcript of the video. The text summarization algorithm compresses the text as much as the user specifies.
It was interesting and did what they claimed, however, we wish you could search for words and have it brought up in the index so users could skip directly to specific parts of a video. This technology has been done in audio, quite well. A service called Happy Scribe, which is aimed at journalists transcribing audio notes, takes an audio file and (for a small fee) transcribes it to text, which can then be searched. It’s pretty elegant and fairly accurate, depending on the audio quality. We could see VidDistill using this mentality to great success.
Patrick Roland, February 7, 2018
DarkCyber for January 30, 2018, Now Available
January 30, 2018
DarkCyber for January 30, 2018, is now available at www.arnoldit.com/wordpress and on Vimeo at www.vimeo.com at https://vimeo.com/253109084.
This week’s program looks at the 4iq discovery of more than one billion user names and passwords. The collection ups the ante for stolen data. The Dark Web database contains a search system and a “how to” manual for bad actors. 4iq, a cyber intelligence specialist, used its next-generation system to locate and analyze the database.
Stephen E Arnold said:
“The technology powering 4iq combines sophisticated data acquisition with intelligent analytics. What makes 4iq’s approach interesting is that the company integrates trained intelligence analysts in its next-generation approach. The discovery of the user credentials underscores the importance of 4iq’s method and the rapidly rising stakes in online access.”
DarkCyber discusses “reputation scores” for Dark Web contraband sites. The systems emulate the functionality of Amazon and eBay-style vendor report cards.
Researchers in Germany have demonstrated one way to compromise WhatsApp secure group chat sessions. With chat and alternative communication channels becoming more useful to bad actors than Dark Web forums and Web sites, law enforcement and intelligence professionals seek ways to gather evidence.
DarkCyber points to a series of Dark Web reviews. The sites can be difficult to locate using Dark Web search systems and postings on pastesites. One of the identified Dark Web sites makes use of a hosting service in Ukraine.
About DarkCyber
DarkCyber is one of the few video news programs which presents information about the Dark Web and lesser known Internet services. The information in the program comes from research conducted for the second edition of “Dark Web Notebook” and from the information published in Beyond Search, a free Web log focused on search and online services. The blog is now in its 10th year of publication, and the backfile consists of more than 15,000 stories.
Kenny Toth, January 30, 2018
Transcribing Podcasts with Help from Amazon
January 19, 2018
I enjoy walking the dog and listening to podcasts. However, I read more quickly than I listen. Speed up is a feature which works well for those in their mid 20s. At age 74, not so much.
Few podcasts create transcripts. Kudos to Steve Gibson at Security Now. He pays for this work himself because other podcasts on the Twit network don’t offer much in the way of transcripts. And in the case of This Week in Law, there aren’t weekly programs. Recently, no programs. Helpful, no?
You can get the basics of the transcriptions produced by Amazon Transcribe in “Podcast Transcription with Amazon Transcribe.”
One has to be a programmer to use the service. Here’s the passage in the write up I highlighted:
The first thing that I would want out of this is speaker detection, i.e. knowing how many different speakers there are and to be able to differentiate their voices. Podcasts typically have more than one host, or a host and a guest for an interview, so that would be helpful. Also, it would be great to be able to send back corrections on words somehow, to help with the training. I’m sure Amazon has a pretty good thing going, but maybe on an account level? Or for proper nouns? I still think it would be good for people to provide that feedback.
Perhaps the podcast transcript void can be filled—at long last.
Stephen E Arnold, January 19, 2018
Google Search: More Digital Gutenberg Action
December 24, 2017
Years ago I wrote “Google: The Digital Gutenberg.” The point of the monograph was to call attention to the sheer volume of content which Google generates. Few people outside of my circle of clients who paid for the analysis took much notice.
I spotted this article in my stream of online content. “Google Search Updates Take a Personalized Turn” explains that a Google search for oneself – what some folks call an egosearch – returns a list of results with a bubblegum card about the person. (A bubblegum card is intel jargon for a short snapshot of a person of interest.)
The publishing angle – hence the connection to Gutenberg – is that the write up reports the person who does an egosearch can update the information about oneself.
A number of interesting angles sparkle from this gem of converting search into someone more “personal.” What’s interesting is that the functionality reaches back to the illustration of a bubblegum card about Michael Jackson which appears in US20070198481. Here’s an annotated patent document snippet from one of my for-fee Google lectures which I was giving in the 2006 to 2009 time period:
Some information professionals will recognize this as an automated bubble-gum card complete with aliases, personal details, last known location, etc. If you have money to spend, there are a number of observations my research team formulated about this “personalization” capability.
I liked this phrase in the Scalzi write up: “pretty deep into the Google ecosystem.” Nope, there is much more within the Google content parsing and fusion system. Lots, lots more for “Automatic Object Reference Identification and Linking in a Browseable Fact Repository.”
Keep in mind that this is just one output from the digital Gutenberg which sells ads, delivers free to you and me online search, and tries to solve death and other interesting genetic issues.
Stephen E Arnold, December 24, 2017
Progress: From Selling NLP to Providing NLP Services
December 11, 2017
Years ago, Progress Software owned an NLP system. I recall conversations with natural language processing wizards from Easy Ask. Larry Harris developed a natural language system in 1999 or 2000. Progress purchased EasyAsk in 2005 if memory serves. I interviewed Craig Bassin in 2010 as part of my Search Wizards Speak series.
The recollection I have was that Progress divested itself of EasyAsk in order to focus on enterprise applications other than NLP. No big deal. Software companies are bought and sold everyday.
However, what makes this recollection interesting to me is the information in “Beyond NLP: 8 Challenges to Building a Chatbot.” Progress went from a software company who owned an NLP system to a company which is advising people like me how challenging a chatbot system can be to build and make work. (I noted that the Wikipedia entry for Progress does not mention the EasyAsk acquisition and subsequent de-acquisition.) Either small potatoes or a milestone best jumped over I assume.)
Presumably it is easier to advise and get paid to implement than funding and refining an NLP system like EasyAsk. If you are not familiar with EasyAsk, the company positions itself in eCommerce site search with its “cognitive eCommerce” technology. EasyAsk’s capabilities include voice enabled natural language mobile search. This strikes me as a capability which is similar to that of a chatbot as I understand the concept.
History is history one of my high school teachers once observed. Let’s move on.
What are the eight challenges to standing up a chatbot which sort of works? Here they are:
- The chat interface
- NLP
- The “context” of the bot
- Loops, splits, and recursions
- Integration with legacy systems
- Analytics
- Handoffs
- Character, tone, and persona.
As I review this list, I note that I have to decide whether to talk to a chatbot or type into a box so a “customer care representative” can assist me. The “representative” is, the assumption is, a smart software robot.
I also notice that the bot has to have context. Think of a car dealer and the potential customer. The bot has to know that I want to buy a car. Seems obvious. But okay.
“Loops, splits, and recursions.” Frankly I have no idea what this means. I know that chatbot centric companies use jargon. I assume that this means “programming” so the NLP system returns a semi-on point answer.
Integration with legacy systems and handoffs seem to be similar to me. I would just call these two steps “integration” and be done with it.
The “character, tone, and persona” seems to apply to how the chatbot sounds; for example, the nasty, imperious tone of a Kroger automated check out system.
Net net: Progress is in the business of selling advisory and engineering services. The reason, in my opinion, was that Progress could not crack the code to make search and retrieval generate expected payoffs. Like some Convera executives, selling search related services was a more attractive path.
Stephen E Arnold, December 11, 2017
Google Search and Hot News: Sensitivity and Relevance
November 10, 2017
I read “Google Is Surfacing Texas Shooter Misinformation in Search Results — Thanks Also to Twitter.” What struck me about the article was the headline; specifically, the implication for me was that Google was not responding to user queries. Google is actively “surfacing” or fetching and displaying information about the event. Twitter is also involved. I don’t think of Twitter as much more than a party line. One can look up keywords or see a stream of content containing a keyword or a, to use Twitter speak, “hash tags.”
The write up explains:
Users of Google’s search engine who conduct internet searches for queries such as “who is Devin Patrick Kelley?” — or just do a simple search for his name — can be exposed to tweets claiming the shooter was a Muslim convert; or a member of Antifa; or a Democrat supporter…
I think I understand. A user inputs a term and Google’s system matches the user’s query to the content in the Google index. Google maintains many indexes, despite its assertion that it is a “universal search engine.” One has to search across different Google services and their indexes to build up a mosaic of what Google has indexed about a topic; for example, blogs, news, the general index, maps, finance, etc.
Developing a composite view of what Google has indexed takes time and patience. The results may vary depending on whether the user is logged in, searching from a particular geographic location, or has enabled or disabled certain behind the scenes functions for the Google system.
The write up contains this statement:
Safe to say, the algorithmic architecture that underpins so much of the content internet users are exposed to via tech giants’ mega platforms continues to enable lies to run far faster than truth online by favoring flaming nonsense (and/or flagrant calumny) over more robustly sourced information.
From my point of view, the ability to figure out what influences Google’s search results requires significant effort, numerous test queries, and recognition that Google search now balances on two pogo sticks. Once “pogo stick” is blunt force keyword search. When content is indexed, terms are plucked from source documents. The system may or may not assign additional index terms to the document; for example, geographic or time stamps.
The other “pogo stick” is discovery and assignment of metadata. I have explained some of the optional tags which Google may or may not include when processing a content object; for example, see the work of Dr. Alon Halevy and Dr. Ramanathan Guha.
But Google, like other smart content processing today, has a certain sensitivity. This means that streams of content processed may contain certain keywords.
When “news” takes place, the flood of content allows smart indexing systems to identify a “hot topic.” The test queries we ran for my monographs “The Google Legacy” and “Google Version 2.0” suggest that Google is sensitive to certain “triggers” in content. Feedback can be useful; it can also cause smart software to wobble a bit.
T shirts are easy; search is hard.
I believe that the challenge Google faces is similar to the problem Bing and Yandex are exploring as well; that is, certain numerical recipes can over react to certain inputs. These over reactions may increase the difficulty of determining what content object is “correct,” “factual,” or “verifiable.”
Expecting a free search system, regardless of its owner, to know what’s true and what’s false is understandable. In my opinion, making this type of determination with today’s technology, system limitations, and content analysis methods is impossible.
In short, the burden of figuring out what’s right and what’s not correct falls on the user, not exclusively on the search engine. Users, on the other hand, may not want the “objective” reality. Search vendors want traffic and want to generate revenue. Algorithms want nothing.
Mix these three elements and one takes a step closer to understanding that search and retrieval is not the slam dunk some folks would have me believe. In fact, the sensitivity of content processing systems to comparatively small inputs requires more discussion. Perhaps that type of information will come out of discussions about how best to deal with fake news and related topics in the context of today’s information retrieval environment.
Free search? Think about that too.
Stephen E Arnold, November 10, 2017