For the first time I am working with multiple topics, need help in designing a data pipeline which is scalable and high performance considering Kafka cluster have 3 brokers.
- Its a batch processing, Kafka is used for data integration between source and sink, no transformation of data is involved.
- Collecting data from 1000+ systems every hour and pushing to Kafka, so can we have number of topics = no. of system?
- What all parameter to consider for number of partitions?
- how to decide on the consumer groups and no. of consumers?
- Sink is redshift and S3, is there any Kafka integration API with AWS?