I am learning about PySpark and RDD. I was doing an exercise which involves broadcasting the contents of a file in the gateway node. I have a few questions.
-
Is it possible to broadcast a RDD? I am assuming it is not but not sure. I am thinking it will involve grouping each partition together and creating a partition across the worker nodes.
-
Let’s say i have a 100 node cluster. And I have a data file of size 1 GB with default block size of 128 MB leading to 8 partitions which is spread across 8 worker nodes. And I have another file of size 2 MB. When i try to broadcast the the contents of the smaller file, will it have a copy across all 100 worker nodes?
Thank You