I’m currently using Trino SQL to read and join files from different MinIO endpoints. However, I’m facing an issue with the following setup (all installed using Docker):
- Trino: version 447
- Configured
minio1.properties
andminio2.properties
, both using the same Hive Metastore.
- Configured
- Hive Metastore: version 3.3.1 (using PostgreSQL)
hive-site.xml
is configured to connect to MinIO1.- MinIO1 is set up to avoid access-key issues when trying to read MinIO data using Trino SQL.
- MinIO: 2 containers
- MinIO1 contains bucket1 with file /bucket1/data/a.csv.
- MinIO2 contains bucket2 with file /bucket2/data/b.csv.
hive-site.xml for HiveMetastore
<configuration>
...
<property>
<name>fs.s3a.access.key</name>
<value><access-key-1></value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value><secret-key-1></value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>http://minio-1:9000</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
</configuration>
Trino Catalogs
# minio1.properties
connector.name=hive
fs.native-s3.enabled=true
hive.metastore.uri=thrift://hive-metastore:9083
hive.non-managed-table-writes-enabled=true
s3.aws-access-key=<access-key-1>
s3.aws-secret-key=<secret-key-1>
s3.endpoint=http://minio-1:9000
s3.max-connections=3
s3.path-style-access=true
s3.region=default
s3.streaming.part-size=32MB
s3.tcp-keep-alive=false
# minio2.properties
connector.name=hive
fs.native-s3.enabled=true
hive.metastore.uri=thrift://hive-metastore:9083
hive.non-managed-table-writes-enabled=true
s3.aws-access-key=<access-key-2>
s3.aws-secret-key=<secret-key-2>
s3.endpoint=http://minio-2:9000
s3.max-connections=3
s3.path-style-access=true
s3.region=default
s3.streaming.part-size=32MB
s3.tcp-keep-alive=false
Trino SQL:
CREATE TABLE minio1.default.data1 (
a VARCHAR,
b VARCHAR
) WITH (
format = 'CSV',
external_location = 's3a://bucket1/data/a.csv'
);
CREATE TABLE minio2.default.data2 (
a VARCHAR,
b VARCHAR
) WITH (
format = 'CSV',
external_location = 's3a://bucket2/data/b.csv'
);
Issue:
The first SQL query executes successfully, but the second one fails with the following error:
Got exception: org.apache.hadoop.fs.s3a.UnknownStoreException `s3a://bucket2/data/b.csv':
getFileStatus on s3a://bucket2/data/b.csv:
com.amazonaws.services.s3.model.AmazonS3Exception:
The specified bucket does not exist (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: aaa; S3 Extended Request ID: 711af550-b344-4a8e-a564-c7362c2ff38b; Proxy: null), S3 Extended Request ID: 711af550-b344-4a8e-a564-c7362c2ff38b:NoSuchBucket: The specified bucket does not exist (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: 17D8C3C418303754; S3 Extended Request ID: 711af550-b344-4a8e-a564-c7362c2ff38b; Proxy: null)
The reason for the error seems to be that it attempts to read the same path (/bucket2/data/b.csv) to minio1 without reading the data from minio2(hive-site.xml setting in hive-metastore is set to minio1). However, I definitely used minio2 catalog of trino.
Question:
Why is this happening, and how can I correctly configure Trino and Hive Metastore to access and join data from multiple MinIO endpoints?
Thank you in advance for your help!