Open Grid Engine or Akka/Something more fault tolerant?

My use case is that I have a pipeline of independent, stand alone programs, that I want to execute in a certain order on specific pieces of data that are output from previous pipeline stages.

The pipeline is entirely linear and doesn’t do anything in terms of alternate paths through the pipe.

I’m currently using SGE to do this and it works OK, however occasionally a job will overstep it’s memory bounds, fail, and all jobs that require that output data will fail. The pipe needs to be restarted in that case, and it seems that whatever is providing the fault tolerance in akka might solve that for me?

There are many ways to provide fault tolerance for your service. It’s heavily depending on your use case, existing tech you already have, and how pervasive your stack’s ecosystem is. It’s up to you to choose what to use and how to combine it in your code. Akka is a huge framework with a lot of functionality, and if you are using it only for providing fault tolerance, it can be an overkill, but again, it depends on your use case.

Retrying

The most naive is to to wrap your job in some retry strategy directly in your code, for example sometihng like: Spring Retry.
That would detect when your job has failed, and retry it again.

Resource Managers

Another, not so naive approach, would be to use a resource manager like YARN, Mesos, or some scheduling mechanism like AWS Batch(considering you are using SGE, probably not a good fit). All these technologies might be an overkill in your use case but they have very good capabilities in high availability environments, so they are solutions worth to be considered if you are chasing scale.

Monitor from outside

Another approach could be that each sub-job reports a healthcheck to a instance (let’s call it a Monitor) which sole purpose is to monitor the state of the pipeline. If a job fails some healthchecks, then the Monitor instance is responsible for restarting it, or restarting the whole pipe.

This seems to be the simplest and most logical approach in your use case, especially if you combine it with each job keeping state (or your Monitor keeping it for them), so if one of them fails, you can restart the pipeline from the LAST successfully completed job.

If you observe a normal pipeline built in Hadoop over YARN, a common approach for each job is to get its input from a PATH in hdfs (where you keep your state), and its output is also a PATH on hdfs, so other jobs can take it as their input. YARN, which is the resource manager, takes care of job state, progress, failures, etc, and allocating resources to that job. This sounds similar to your SGE case, it’s just you probably have to implement some (or a lot of the monitor + state logic). If you don’t care about state, and are fine restarting the pipeline from scratch, it’s much easier as you have only two very simple moving parts:

detect when a job is failed
restart the whole pipeline

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 18:16

Thẻ: distributed-computing, grid-computing, scalability

Thiết kế website giá rẻ

Danh mục

Open Grid Engine or Akka/Something more fault tolerant?