OSINT for Amateurs

January 13, 2022

Today I had a New Year chat with a person whom I met at specialized services conferences. I relayed to my friend the news that Robert David Steele, whom I knew since 1986, died in the autumn of 2021. Steele, a former US government professional, was described as one of the people who pushed open source intelligence down the bobsled run to broad use in government entities. Was he the “father of OSINT”? I don’t know, He and I talked via voice and email each week for more than 30 years. Our conversations explored the value of open source intelligence and how to obtain it.

After the call I read “How to Find Anyone on the Internet for Free.”

Wow, shallow. Steele would have had sharp words for the article.

The suggestions are just okay. Plus it is clear that a lack of awareness about OSINT exists.

My suggestion is that anyone writing about this subject spend some time learning about OSINT. There are books from professionals like Steele as well as my CyberOSINT: Next Generation Information Access. Also, attending a virtual conference about OSINT offered by those who have a background in intelligence would be useful. Finally, there are numerous resources available from intelligence gathering organizations. Some of these “lists” include a description of each site, service, or system mentioned.

For me and my team’s part, we are working to create 60 second videos which we will make available on Instagram-type services. Each short profile of an OSINT resource will appear under the banner “OSINT Radar.” These will be high value OSINT resources. Some of this information will also be presented in a new series of short articles and videos that Meg Coker, a former senior telecommunications executive, and I will create. Look for these in LinkedIn and other online channels.

Hopefully the information from OSINT Radar and the Coker-Arnold collaboration will provide useful data about OSINT resources which are useful and effective. Free and OSINT can go together, but the hard reality is that an increasing number of OSINT resources charge for the information on offer.

OSINT, unfortunately, is getting more difficult to obtain. Examples include China’s cut offs of technology information and the loss of shipping and train information from Ukraine. And there are more choke points; for example, Iran and North Korea. This means that OSINT is likely to require more effort than previously. The mix of machine and human work is changing. Consequently more informed and substantive information about OSINT will be required in 2022. The OSINT for amateurs approach is an outdated game.

Coker and Arnold are playing a new game.

Stephen E Arnold, January 13, 2022

Written by Stephen E. Arnold · Filed Under News, Open source, OSINT, Reference tool | Comments Off on OSINT for Amateurs

Disrupting Commercial Sci-Tech Indexes

November 10, 2021

Pooling knowledge is beneficial for advancing research. Despite the availability of digital databases on the Internet, these individual databases are not connected. Nature shares that an American technologist created a, “Giant, Free Index To World’s Research Papers Released Online.”

Carl Malamud designed an online index that catalogs words and short phrases from over one hundred journal articles, including paywalled papers. Malamud released the index under his California non-profit Public Resource. The index is free and its purpose is to help scientists discover insights from all research, even if stuck behind paywalls. Technically Malamud does not have the legal right to index the paywalled articles. However, the index only contains short sentences less than five letters long from the paywalled articles. It does not violate copyright. Publishers may still argue that the index is a violation.

The index is a major innovation:

“Malamud’s General Index, as he calls it, aims to address the problems faced by researchers such as Yadav. Computer scientists already text mine papers to build databases of genes, drugs and chemicals found in the literature, and to explore papers’ content faster than a human could read. But they often note that publishers ultimately control the speed and scope of their work, and that scientists are restricted to mining only open-access papers, or those articles they (or their institutions) have subscriptions to. Some publishers have said that researchers looking to mine the text of paywalled papers need their authorization.”

Some publishers, like Springer Nature, support open source development projects like the Malamud General Index. Springer Nature said open source projects do encounter problems when they do not secure proper rights.

Publishers do not have a case against Malamud. The index does not violate copyright and full text articles are not published in it. Instead the index pools a wealth of information and exposes paywalled articles to a larger audience, who will purchase content if it is helpful to research.

Publishers, however, may need convincing of this perspective.

Whitney Grace, November 10, 2021

Written by Stephen E. Arnold · Filed Under News, Publishing, Reference tool | 1 Comment

Yarchives: a Multi-Topic Repository of Information

October 5, 2021

Here is a useful resource, a repository of Usenet newsgroup articles collected and maintained by computer scientist Norman Yarvin. The Yarchive houses articles on twenty-two wide-ranging topics, from air conditioning to jokes to space. We note a couple that might be of interest to today’s assorted revolutionaries (or those tasked with countering them): explosives and nuclear technologies. Hmm. Perhaps there is a need to balance unfettered access to information with wisdom. The site’s About page reveals some details about Yarvin’s curation process. He writes:

“Articles are not put up here immediately; only a year or three after first saving them do I look at them again, sort them out, and make index pages for them. (By that time I’ve forgotten enough of them to make them worth rereading — and if I find they are not worth rereading, I discard them.) I’ve largely automated the making of index pages; the programs I’ve written for it (mostly in Perl) are available as a tar file (tools.tar). The making of the links to search for Google’s copy of each article is also automated. If it stops working because Google changed their query syntax, please let me know. Links that are on the Message-ID line of the header should link straight to the article in question; other links (from articles I’ve lost the Message-ID for) should invoke a search. For articles from the linux-kernel mailing list, links that are on the Original-Message-ID line of the header are to kernel.org’s copy of the article. (They used to be to GMANE, but that service went away.) Some changes have been made to these articles, but nothing that would destroy any possible meaning.”

The project seems to be quite the hobby for Yarvin. He goes on to describe the light corrections he makes, articles’ conversion to the UTF-8 character encoding, and his detailed process of checking the worthiness of URLs and making the valuable ones clickable.

Readers may want to peruse the Yarchive and/or bookmark it for future use. Information relevant to many of our readers can be found here, like files on computers, electronics, and security. More generally useful topics are also represented; cars, food, and houses, for example. Then there are the more specialized topics, like bicycles, chemistry, and metalworking. There is something here for everyone, it seems.

Cynthia Murrell, October 5, 2021

Written by Stephen E. Arnold · Filed Under News, Reference tool | Comments Off on Yarchives: a Multi-Topic Repository of Information

Free Resource on AI for Physical Simulations

September 27, 2021

The academics at the Thuerey Group have made a useful book on artificial intelligence operations and smart software applications available online. The Physics-Based Deep Learning Book is a comprehensive yet practical introduction to machine learning for physical simulations. Included are code examples presented via Jupyter notebooks. The book’s introduction includes this passage:

“People who are unfamiliar with DL methods often associate neural networks with black boxes, and see the training processes as something that is beyond the grasp of human understanding. However, these viewpoints typically stem from relying on hearsay and not dealing with the topic enough. Rather, the situation is a very common one in science: we are facing a new class of methods, and ‘all the gritty details’ are not yet fully worked out. However, this is pretty common for scientific advances. … Thus, it is important to be aware of the fact that – in a way – there is nothing magical or otherworldly to deep learning methods. They’re simply another set of numerical tools. That being said, they’re clearly fairly new, and right now definitely the most powerful set of tools we have for non-linear problems. Just because all the details aren’t fully worked out and nicely written up, that shouldn’t stop us from including these powerful methods in our numerical toolbox.”

This virtual tome would be a good place to start doing just that. Interested readers may want to begin studying it right away or bookmark it for later. Also see the Thuerey Group’s other publications for more information on numerical methods for deep-learning physics simulations.

Cynthia Murrell, September 27, 2021

Written by Stephen E. Arnold · Filed Under AI, News, Reference tool | Comments Off on Free Resource on AI for Physical Simulations

Simple Error for a Simple Link to the Simple Sabotage Field Manual

September 13, 2021

I love Silicon Valley type “real” news. I spotted a story called “The 16 Best Ways to Sabotage Your Organization’s Productivity, from a CIA Manual Published in 1944.” What’s interesting about this story is that the US government publication has been in circulation for many years. The write up states:

The “Simple Sabotage Field Manual,” declassified in 2008 and available on the CIA’s website, provided instructions for how everyday people could help the Allies weaken their country by reducing production in factories, offices, and transportation lines. “Some of the instructions seem outdated; others remain surprisingly relevant,” reads the current introduction on the CIA’s site. “Together they are a reminder of how easily productivity and order can be undermined.”

There’s one tiny flaw — well, two actually — in this Silicon Valley type “real” news report.

First, the url provided in the source document is incorrect. To download the document, navigate to this page or use this explicit link: https://www.hsdl.org/?view&did=750070. We verified both links at 0600, September 13, 2021.

And the second:

The write up did not include the time wasting potential of a Silicon Valley type publication providing incorrect information via a bad link. Mr. Donovan, the author of the document, noted on page 30:

Make mistakes in quantities of material when you’ are copying orders. Confuse similar names. Use wrong addresses.

Silly? Maybe just another productivity killer from the thumbtyping generation.

Stephen E Arnold, September 13, 2021

Written by Stephen E. Arnold · Filed Under News, Reference tool | Comments Off on Simple Error for a Simple Link to the Simple Sabotage Field Manual

The British Library Channels University Microfilms and the Google

September 1, 2021

While a quick Google search can yield pertinent information, it is hard to find. Why? Google search results are clogged with paid ads and Web sites that are not authoritative sources. Newspapers are still a valuable resource, especially newspapers from before the Internet’s invention. The brilliant news is, as IanVisits shares, is that, “The British Library Puts 1 Million Newspaper Pages Online For Free.”

The British Newspaper Archive contains over forty-four million newspaper pages that range from 1600-2009. The newspapers are from British and Irish sources and they are over 10% of the newspapers the British Library owns. Around half a million pages are added the archive every month.

The newspapers currently require a subscription, but all funds go to scanning more pages to the archive. The British Newspaper Archive has released one million pages for free and plans to add another million over the next four years. Not all pages will be free, however:

“They won’t add all papers, as they say that while they consider newspapers made before 1881 to be in the public domain, that does not mean that will make all pre-1881 digitized titles available for free, as the archive is dependent on subscriptions to cover its costs. If like me you do a lot of historical research, then the cost of the full subscription is not that bad – just £80 a year for the full archive.”

The archive offers 158 free newspaper titles that range from 1720-1880. All of the newspapers that fall within this date range are in the public domain.

It would be awesome if all newspapers were available for free on the Internet, but money makes the world go round. Libraries and universities offer free access to newspaper databases and subscription services, in most cases, are not that expensive.

The good news is that researchers may have access to news stories infused with some of that good old “real” journalistic wire tapping.

Whitney Grace, September 1, 2021

Written by Stephen E. Arnold · Filed Under News, Reference tool | Comments Off on The British Library Channels University Microfilms and the Google

The Internet Archive Dons a Scholar Skin

April 23, 2021

Some of today’s biggest social faux pas are believing everything on the Internet, clicking the first link in search results, and buying items from questionable Internet ads. It is easy to forget that search engines like Google and Bing are for-profit search engines that put paid links at the top of search results. What is even worse is scientific and scholarly information is locked behind expensive paywalls.

Wikipedia is often believed to be a reliable source, but despite the dedication of wiki editors the encyclopedia is not 100% accurate. There are free scholarly databases and newspapers often have their archives online, but that information is not widely known.

Thankfully the Internet Archive is fairly famous. The Internet Archive is a non-profit digital library that provides users with access to millions of free books, music, Web sites, videos, and software. They also allow users to peruse old Web sites with the Wayback Machine.

The Internet Archive recently introduced a brand new service that is sheer genius: Internet Archive Scholar. It is described as:

“This full text search index includes over 25 million research articles and other scholarly documents preserved in the Internet Archive. The collection spans from digitized copies of eighteenth century journals through the latest Open Access conference proceedings and pre-prints crawled from the World Wide Web.”

Why did no one at the Internet Archive think of doing this before? It is a brilliant idea that localizes millions of scholarly articles and other information without paywalls, university matriculation, or a library card. Most of the information available through the Internet Archive Scholar would otherwise remain buried in Google search results or on the Web, like old books gathering dust on library shelves.

Internet Archive Scholar is still in the beta phase and enhancements are a positive step.

Whitney Grace, April 23, 2021

Written by Stephen E. Arnold · Filed Under News, Reference tool, Search | Comments Off on The Internet Archive Dons a Scholar Skin

IA Scholar: A Reminder That Existing Online Resources Are Not Comprehensive

March 10, 2021

We spotted this announcement from the Internet Archive in “Search Scholarly Materials Preserved in the Internet Archive.”

IA Scholar is a simple, access-oriented interface to content identified across several Internet Archive collections, including web archives, archive.org files, and digitized print materials. The full text of articles is searchable for users that are hunting for particular phrases or keywords. This complements our existing full-text search index of millions of digitized books and other documents on archive.org. The service builds on Fatcat, an open catalog we have developed to identify at-risk and web-published open scholarly outputs that can benefit from long-term preservation, additional metadata, and perpetual access. Fatcat includes resources that may be useful to librarians and archivists, such as bulk metadata dumps, a read/write API, command-line tool, and file-level archival metadata. If you are interested in collaborating with us, or are a researcher interested in text analysis applications, we have a public chat channel or can be contacted by email at info@archive.org.

I ran several queries. The system is set up to respond to a conference name, but free text entries worked find; for example, NLP. Here are the results:

Worth checking out. In my experience people who are “experts” in online often forget that no online service is up to date, comprehensive, and set up to deliver full text. One other point: Corrections to online content are rarely, if ever made. Business Dateline, produced by the Courier Journal and Louisville Times in the early 1980s was one of the first commercial databases to include corrections. Thumbtypers may not care, but that’s the zippy modern world.

Stephen E Arnold, March 10, 2021

Written by Stephen E. Arnold · Filed Under News, Online (general), Reference tool | Comments Off on IA Scholar: A Reminder That Existing Online Resources Are Not Comprehensive

Comments about Web Search: Prompted by a Hacker News Thread

November 13, 2020

I spotted a Web search related threat on Hacker News. You can locate the comments at this link. Several observations:

Metasearch. Confusion seems to exist between a dedicated Web search system like Bing, Google, and Yandex and metasearch systems like DuckDuckGo and Startpage. Dedicated Web search systems require considerable effort, but there is less appreciation for the depth of the crawl, the index updating cycle, and similar factors.
Competitors to Google. The comments present a list of search systems which are relatively well known. Omitted are some other services; for example, iSeek, Swisscows, and 50kft.
Bias. The comments do not highlight some of the biases of Web search systems; for example, when are pages reindexed, what pages are on a slow or never update cycle, blacklisted, or processed against a stop word list.

So what?

Many profess to be experts at finding information online. The comments suggest that perception is different from reality.
Locating content on publicly accessible Web sites is more difficult than at any other time in my professional career in the online information sector.
Locating relevant information is increasingly time consuming because predictive, personalized, and wisdom of crowd results don’t work; for example, run this query on any of the search engines:

Voyager search

Did your results point to the Voyager Labs’s system, the UK HR company’s search engine, a venture capital firm, or a Lucene repackager in Orange County? What about Voyager patents? What about Voyager customers?

How can one disambiguate when the index scope is unknown, entity extraction is almost non existent, and deduplication almost laughable? Real time? Ho ho ho.

One can do this work manually. Who wants to volunteer for that. The most innovative specialized search vendors try to automate the process. Some of these systems are helpful; most are not.

Is search getting better? Rerun that Voyager search. See for yourself.

Without field codes, Boolean, and a mechanism to search across publicly accessible content domains, Web search reveals its shortcomings to those who care to look.

Not many look, including professionals at some of the better known Web search outfits.

Stephen E Arnold, November 13, 2020

Written by Stephen E. Arnold · Filed Under News, Reference tool, Search | Comments Off on Comments about Web Search: Prompted by a Hacker News Thread

Science: Just Delete It

September 10, 2020

The information in “Dozens of Scientific Journals Have Vanished from the Internet, and No One Preserved Them” may remind some people that the “world’s information” and the “Internet archives” are marketing sizzle. The steak is the source document. The FBI has used the phrase “going dark” as shorthand for not being able to access certain information. The thrill of not have potentially useful information is one that most researchers prefer to reserve for thrill rides at Legoland.

The write up states:

Eighty-four online-only, open-access (OA) journals in the sciences, and nearly 100 more in the social sciences and humanities, have disappeared from the internet over the past 2 decades as publishers stopped maintaining them, potentially depriving scholars of useful research findings, a study has found. An additional 900 journals published only online also may be at risk of vanishing because they are inactive, says a preprint posted on 3 September on the arXiv server. The number of OA journals tripled from 2009 to 2019, and on average the vanished titles operated for nearly 10 years before going dark, which “might imply that a large number … is yet to vanish…

Flat earthers and those who believe that “just being” is a substitute for academic rigor are probably going to have “thank goodness, these documents are gone” party. I won’t be attending.

Anti-intellectualism is really exciting. Plus, it makes life a lot easier for those in the top one percent of intellectual capability. Why? Extensive reading can fill in some blanks. Who wants to be comprehensive? Oh, I know: “Those who consume TikTok videos and devour Instagram while checking WhatsApp messages.”

Stephen E Arnold, September 10, 2020

Written by Stephen E. Arnold · Filed Under News, Publishing, Reference tool | Comments Off on Science: Just Delete It

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Fogint
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

OSINT for Amateurs

Disrupting Commercial Sci-Tech Indexes

Yarchives: a Multi-Topic Repository of Information

Free Resource on AI for Physical Simulations

Simple Error for a Simple Link to the Simple Sabotage Field Manual

The British Library Channels University Microfilms and the Google

The Internet Archive Dons a Scholar Skin

IA Scholar: A Reminder That Existing Online Resources Are Not Comprehensive

Comments about Web Search: Prompted by a Hacker News Thread

Science: Just Delete It

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta