At my workplace, we are currently planning out our data analytics platform and have chosen to go with an Iceberg lakehouse architecture. However, we’ve encountered a few challenges at the BI consumption layer and are seeking advice on how others have addressed these issues.
Challenges:
1. Dynamic Row Filtering with AWS Lake Formation / Data Zone:
- We found that AWS Lake Formation and Data Zone do not support dynamic row filtering effectively. This limitation prevents us from implementing row-level access controls on Iceberg tables, meaning users cannot be restricted to see only the data they have access to.
- In contrast, this can be easily achieved in Redshift and Snowflake using row access policies.
Query Performance with Athena:
- We have observed that querying data using Athena is slow, even after optimizing with partitioning and other techniques. While we understand that Iceberg is primarily designed for big data analytics, the latency in query performance remains a significant concern for our BI consumption needs.
Proposed Solution:
Given these challenges, we are considering the following approach for our BI consumption layer:
Data Integration: Load our Iceberg tables into a traditional data warehouse (such as Redshift or Snowflake). This would enable:
- Faster query performance for BI tools.
- Enhanced data governance capabilities, including dynamic row filtering and access controls.
- Technical and Batch Analytics: Continue using Iceberg tables directly for technical data analysts and batch processing jobs, leveraging its strengths in big data analytics.
Questions:
- Has anyone else faced similar challenges with Iceberg, and how did you overcome them?
- Is our proposed solution of integrating Iceberg with a traditional data warehouse for the BI layer a common practice? Are there any best practices or potential pitfalls we should be aware of?
- Are there alternative approaches or technologies we should consider to achieve better performance and data governance with Iceberg tables?