I have more than 50 tables spread across different databases in Hive, and I need to retrieve the record counts for each table on a daily basis.
To manage this, I already have a table in Hive called p_tables
that contains the database names and table names in separate columns:
database_name
tab_name
Rather than writing the SQL query manually for each of these 50+ tables, I am trying to automate the process using Hue. My idea is to create a Hive script that can be parameterized and then used within a Workflow. The script I came up with looks like this:
INSERT INTO db.table
SELECT
"parameter1" AS `database_name`,
"parameter2" AS `tab_name`,
count(*) AS `measurement`,
CURRENT_DATE AS `p_date`
FROM parameter1.parameter2;
In this script, parameter1
would be the database_name
and parameter2
would be the tab_name
. The plan is to pass these as parameters in the Workflow.
I attempted to create a Hive script in Hue that uses parameters for the database_name
and tab_name
values. My expectation was that I could then integrate this script into a Workflow, where I would be able to pass the actual database and table names from the p_tables
table as parameters.
However, I’m not sure how to properly set up these parameters in Hue, or if this approach is even feasible. I’m looking for guidance on whether this can be done, and if not, what alternative methods could be used to achieve the same result.
Totova is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.