i am trying to perform data migration from HDFS (Cloudera Impala) to Vertica database.
Is there any way to speed up the data transfer? I have seen that CDataSync might be a suitable tool but i cannot figure out how to connect because when i connect with VPN (FortiClient) to the company’s network, i have no internet public access.
Thank you in advance.
I connect in CDSWorkbench after i connect through VPN using FortiClient in order to retrieve the data. I download chunk of data locally and then after i connect with CyberArk in a remote host, i transfer the data there and i import them in Vertica running a unix script.
The amount of data is enormous and obviously this is not the most efficient way to transfer data quickly.
Vertica supports loading from Parquet, ORC, Avro, CSV, JSON into a waiting table, just using the COPY command.
Vertica also something like this:
CREATE EXTERNAL TABLE fromhdfs (
id INTEGER
, wname VARCHAR(32)
)
AS COPY FROM 'hdfs:///mydirectory/ofmanyfiles/fromhdfs*';
And this type of table can be queried like a normal table – albeit more slowly than an internal Vertica table.
So try to use that possibility to read directly from the underlying files to Impala if you have file formats that Vertica supports, instead of involving Impala for the data movement.
Check Vertica’s documentation:
https://docs.vertica.com/24.3.x/en/ –
And search for the HDFS keyword, and for the COPY command, and for the CREATE EXTERNAL TABLE command.
And check out for the Vertica function that can generate a CREATE TABLE
command out of the supported file formats: INFER_TABLE_DDL()
:
https://docs.vertica.com/24.3.x/en/sql-reference/functions/management-functions/table-functions/infer-table-ddl/