I'm using the code shown below in order to retrieve papers from arXiv. I want to retrieve papers that have words "machine" and "learning" in the title. The number of papers is large, therefore I want to implement a slicing by year (published
).
How can I request records of 2020 and 2019 in search_query
? Please notice that I'm not interested in post-filtering.
import urllib.requestimport timeimport feedparser# Base api query urlbase_url = 'http://export.arxiv.org/api/query?';# Search parameterssearch_query = urllib.parse.quote("ti:machine learning")start = 0total_results = 5000results_per_iteration = 1000wait_time = 3papers = []print('Searching arXiv for %s' % search_query)for i in range(start,total_results,results_per_iteration): print("Results %i - %i" % (i,i+results_per_iteration)) query = 'search_query=%s&start=%i&max_results=%i' % (search_query, i, results_per_iteration) # perform a GET request using the base_url and query response = urllib.request.urlopen(base_url+query).read() # parse the response using feedparser feed = feedparser.parse(response) # Run through each entry, and print out information for entry in feed.entries: #print('arxiv-id: %s' % entry.id.split('/abs/')[-1]) #print('Title: %s' % entry.title) #feedparser v4.1 only grabs the first author #print('First Author: %s' % entry.author) paper = {} paper["date"] = entry.published paper["title"] = entry.title paper["first_author"] = entry.author paper["summary"] = entry.summary papers.append(paper) # Sleep a bit before calling the API again print('Bulk: %i' % 1) time.sleep(wait_time)