I am a Data Engineer.
I have used PySpark for a long time and now moving to Apache Beam/Dataflow .
So,as this is managed services, we dont have to do much.
But, there is one question , I want to know, that
How we can fix skewness of data , while using agg function like GroupByKey and CoGroupByKey.
In spark , we can use salting or enable AQE, but how to achieve this in Apache Beam?
I was first thinking of using reshuffle, but again it will raise same problem.
All the keys, should be on the same worker, which may cause out of memory error.
I want to know, if there is any way?
If yes, can anyone give me the steps.
Thanks.
amarjeet kushwaha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Yes, handling skewness in Apache Beam is achievable. You can mitigate skewness by using techniques like “key reshuffling” or “combiners” to preprocess and balance your data before aggregation.
Julia is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1