I am requesting an Api (limited to 30calls per min so I do not parallelize) in a PySpark notebook (azure synapse analytics), and it takes ages in the synapse notebook while it’s lightning fast on my local env (PY3 kernel – jupyter notebook).
Being quite new to PySpark, I wonder which operation(s) is taking so much time.
# Request functions
# First function
def get_response(url):
headers = {'Authorization': f'Bearer {INSEE_KEY}'}
response = requests.get(url, headers=headers)
return response
#Second function using the first one
def get_data_by_siret(siret):
siret_url = f'{base_url}siret?q=siret:{siret}'
response = get_response(siret_url)
if response.status_code == 200 :
content = response.json()
siret_data = content['etablissements']
print(f'Siret: {siret} collecté. (timestamp: {time.time()})')
return siret_data, 200
elif response.status_code == 429:
print(f'Siret: {siret} ; quota atteint. (timestamp: {time.time()})')
return [], 429
elif response.status_code == 404:
print(f'Siret {siret} inconnu dans la base Sirene')
return [], 404
else :
print(f'Siret {siret}. {response.status_code}, (timestamp: {time.time()})')
return [], response.status_code
Here is the very slow part.
# Collecting Sirets data
sirets_json = []
timestamps = []
siret_429 = []
for siret in array_siret:
time.sleep(1)
siret_data, status = get_data_by_siret(siret)
timestamps.append(time.time())
if status == 200:
sirets_json.extend(siret_data)
elif status == 429:
time.sleep(1)
siret_429.append(siret)
By ‘very slow’ I mean this:
Siret: 04555025800029 collected. (timestamp: 1720795818.7682683) 'Fri Jul 12 14:50:18 2024'
Siret: 05780615000017 collected. (timestamp: 1720795851.364306) 'Fri Jul 12 14:50:51 2024'
Siret: 09572031400475 collected. (timestamp: 1720795883.9143667) 'Fri Jul 12 14:51:23 2024'
On my local env, with the exact same code, I can collect 60 sirets in ~75 seconds.