August 7, 2020

“Leveraging ML to Fuel New Discoveries with the ArXiv Dataset” announces that more than 1.7 million journal-type papers are available without charge on Kaggle. DarkCyber learned:

To help make the ArXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable ArXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

What’s Kaggle? The article explains:

Kaggle is a destination for data scientists and machine learning engineers seeking interesting datasets, public notebooks, and competitions. Researchers can utilize Kaggle’s extensive data exploration tools and easily share their relevant scripts and output with others.

The ArXiv contain metadata for each processed paper (document), including these fields:

  • ID: ArXiv ID (can be used to access the paper, see below)
  • Submitter: Who submitted the paper
  • Authors: Authors of the paper
  • Title: Title of the paper
  • Comments: Additional info, such as number of pages and figures
  • Journal-ref: Information about the journal the paper was published in
  • DOI: [https://www.doi.org](Digital Object Identifier)
  • Abstract: The abstract of the paper
  • Categories: Categories / tags in the ArXiv system
  • Versions: A version history

Details about the data and their location appear at this link. You can use the ArXiv ID to download a paper.

What if you want to search the collection? You may want to download the terabyte plus file and index the json using your favorite search utility. There’s a search system available from ArXiv and you can use the site: operator on Bing or Google to see if one of those ad-supported services will point you to the document set you need.

DarkCyber wants to suggest that you download the corpus now (datasets can go missing) and use your favorite search and retrieval system or content processing system to locate and make sense of the ArXiv content objects.

Stephen E Arnold, August 7, 2020


