Our team is currently in the process of upgrading a system that performs statistical processing using Hadoop and Hive. We are upgrading from Hadoop 0.20 and Hive 0.1.7 to Hadoop 3.3.3 and Hive 3.1.3. During our testing phase, we have encountered several issues.
For example, when performing a JOIN operation on two tables in Hive, the number of output records is significantly lower than the number of records produced by the same operation in the previous version of Hive. The two tables involved in the JOIN operation each contain approximately 70 million records, making them relatively large in scale. We are unsure whether the issue lies in the Hive configuration or the data itself.
We are looking to troubleshoot these issues, but our team lacks the necessary expertise and is unsure of the procedures or tools to use for root cause analysis. We would greatly appreciate any insights on general troubleshooting methods for Hive 3, as well as commonly used diagnostic tools.
We would be grateful for any advice from those with experience in this area.
[Development Environment]
- OS: RockyLinux 9.0
- Hadoop
- Version: 3.3.3
- NameNode: 3 nodes (Each node specs: Logical cores: 24, Memory: 32GB)
- DataNode: 126 nodes (Each node specs: Logical cores: 48, Memory: 128GB)
- Hive
- Version: 3.1.3
- Hive Engine: MR (TeZ not used)
Please excuse any errors in my English as it is not my strong suit. Thank you for your assistance.
hisa is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.