I’m working on my first small data project to add to my portfolio.
I start to understand the ETL and ELT process and use Prefect for orchestration, Airbyte for moving my data and Dbt for transformation.
After a quick small script to retrieve data from IMDB (To check if my script can send data to my postgres database in a docker container in a EC2)
It took 22 minutes to run my script.
That’s quite long ^^’
Here is the script
from prefect import flow, task
import requests
import pandas as pd
from prefect_sqlalchemy import DatabaseCredentials
@flow
def extract_imdb():
credentials = DatabaseCredentials.load("imdb-postgres", validate=False)
engine = credentials.get_engine()
url = "https://datasets.imdbws.com/name.basics.tsv.gz"
response = requests.get(url)
df = pd.read_csv(url, sep='t', compression='gzip')
df.to_sql("name_basics", engine, if_exists='replace', index=False)
extract_imdb()
22 minutes is quite long and I guess my way to do it isn’t optimal
Do you have any tips? libraries ?
I saw some people loading data locally, is it correct ?
Do you know a place where I can see people’s project ?
Unfortunately I haven’t find much on the web …
I’d be very happy to learn from you 🙂
Julien is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.