I am using Pandas for my Data Cleaning and Data Transformation processes.
Currently I am working on a single data frame with more than 80K records/ rows.
My requirement is to make an GET API call, which takes a parameter form only one column but iterates through the every row. The result is been stored into another column, in the same row of the parameter.
My problem is TIME. I ran my function logic that calls the API, and it is working as i expected.
If I am passing records with size around 5 – 10 the result is instant, while if I pass around 50 it takes 10 – 20 seconds.
Now I need to do that for more than 80K records. What is the most optimal way to do that?
- Should I follow Batch processing method? But then how should i figure out the size of the batch?
- How can I figure out the API server capacity or its Allowed limit?
- I was thinking of creating a pipeline using Apache Nifi, or Apache Airflow. But I have no experience with them, so I am not able to form the pipeline.
- Are there any other approach to solve this problem?
The Base URL for the API call is: https://api.company-information.service.gov.uk/advanced-search/companies
It requires an API KEY. And a parameter ‘company_name_includes’: company
I tried to call GET API method for all the records at once.
Tried simple parallel processing in Python.
Still I did not got the result.
I am expecting more optimal approach.
Pratham Soni is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.