Backstory: I have 12 zip files in Gen2 storage, each around 300 mb.
I am running a notebook within a pipeline job.
It goes smoothly until extracting the 6th zip file using pd.read_csv(compression =’zip’).
around the 27min 57 secs mark, the token is expired.
So I open and played the synapse notebook.
ClientAuthenticationError: Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
ErrorCode:InvalidAuthenticationInfo
Authenticationerrordetail: Lifetime validation failed. The token is expired.
When I run a pipeline, it dies down at the similar position.
I am using the bare minimum as my company doesn’t have enough funds for increasing nodes, etc.. Microsoft suggest to apply a retry-upon-failure. Also said synapse inability to handle token refreshes for non-users identities.
Is there a work around for this? Or is there screencaps that can assist me? Thanks!
2
In synapse you have below option to authenticate.
Using linked service and Using storage options.
- Create a linked service to adls gen 2 account and use it while reading the files.
You can use system or user managed identity for authentication.
code:
import pandas as pd
df = pd.read_csv('abfs://<container_name>@<storage_acc_name>.dfs.core.windows.net/<path>/parse1_data_preview.csv', storage_options = {'linked_service' : '<linked_service_name>'})
df
Output:
This doesn’t expire, linked service handles the things.
- In storage option below, you can pass below credentials.
code:
import pandas
#read data file
df = pandas.read_csv('abfs://file_system_name@account_name.dfs.core.windows.net/ file_path', storage_options = {'account_key' : 'account_key_value'})
## or storage_options = {'sas_token' : 'sas_token_value'}
## or storage_options = {'connection_string' : 'connection_string_value'}
## or storage_options = {'tenant_id': 'tenant_id_value', 'client_id' : 'client_id_value', 'client_secret': 'client_secret_value'}
Here, i recommend to use tenant_id
, client_id
, and client_secret
.
Refer below documentation for more information.
Tutorial: Use Pandas to read/write ADLS data in serverless Apache Spark pool in Synapse Analytics – Azure Synapse Analytics | Microsoft Learn
Tutorial: Use FSSPEC to read/write ADLS data in serverless Apache Spark pool in Synapse Analytics – Azure Synapse Analytics | Microsoft Learn
If you still want to use the tokens you need to keep on checking the expiry.
Below is the logic, you alter it according to your requirement.
token_expiry_buffer = 300
token_expiry_time = 0
def get_token():
global token_expiry_time
token = credential.get_token("https://storage.azure.com/")
token_expiry_time = token.expires_on
return token
def is_token_expired():
current_time = time.time()
return current_time >= (token_expiry_time - token_expiry_buffer)
paths = ["zip1path","zip2path","zip3path"...]
for path in paths:
if is_token_expired:
#refresh the token
get_token()
df = pd.read_csv("path")