Approach to Auto Scaling Flink Job

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Approach to Auto Scaling Flink Job

Anil
I'm using Uber Open Source project Athenax.  As mentioned in it's docs[1] it
supports `Auto scaling for AthenaX jobs`. I went through the source code on
Github but didn't find the auto scaling  part. Can someone aware of this
project please point me in the right direction here.

I'm using Flink's Table API (Flink 1.4.2) and submit my jobs programatically
to the Yarn Cluster. All the JM and TM metric are saved in Prometheus. I am
thinking of using these metric to develop an algo to re-scale jobs. I would
also appreciate if someone could share how they developed there auto-scaling
part.

[1]  https://athenax.readthedocs.io/en/latest/
<https://athenax.readthedocs.io/en/latest/>  




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Approach to Auto Scaling Flink Job

Rong Rong
Hi Anil,

Thanks for reporting the issue. I went through the code and I believe the auto-scaling functionality is still in our internal branch and has not been merged to the open-source branch yet. 
I will change the documentation accordingly. 

Thanks,
Rong

On Mon, May 6, 2019 at 9:54 PM Anil <[hidden email]> wrote:
I'm using Uber Open Source project Athenax.  As mentioned in it's docs[1] it
supports `Auto scaling for AthenaX jobs`. I went through the source code on
Github but didn't find the auto scaling  part. Can someone aware of this
project please point me in the right direction here.

I'm using Flink's Table API (Flink 1.4.2) and submit my jobs programatically
to the Yarn Cluster. All the JM and TM metric are saved in Prometheus. I am
thinking of using these metric to develop an algo to re-scale jobs. I would
also appreciate if someone could share how they developed there auto-scaling
part.

[1]  https://athenax.readthedocs.io/en/latest/
<https://athenax.readthedocs.io/en/latest/




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Approach to Auto Scaling Flink Job

Anil
Thanks for the reply Rong. Can you please let me know the design for the
auto-scaling part, if possible.
Or guide me in the direction so that I could create this feature myself.

 



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Approach to Auto Scaling Flink Job

Rong Rong
Hi Anil,

We have a presentation[1] that briefly discuss the higher level of the approach (via watchdog) in FlinkForward 2018. 

We are also restructuring the approach of our open-source AthenaX: 
Right now our internal implementation has diverged from the open-source for too long, it has been a problem for us to merged back to open-source upstream. So we are likely to create a new modularized version of AthenaX in the future. 

Thanks for the interested, and please stay tune for our next release.

Best,
Rong


On Wed, May 8, 2019 at 11:32 AM Anil <[hidden email]> wrote:
Thanks for the reply Rong. Can you please let me know the design for the
auto-scaling part, if possible.
Or guide me in the direction so that I could create this feature myself.





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Approach to Auto Scaling Flink Job

Anil
Thanks Rong. FlinkForward talk was insightful.
One more question, it's mentioned in the talk that the jobs are running on
Yarn and are monitored by containers running on Docker. Can you explain why
is Docker needed here. When we deploy job to Yarn, one Yarn container is
already dedicated for Job Manager which monitors the job. What additional
functionality does Docker provide here.
Also when the jobs are deployed on Yarn, the Master Node becomes a Single
point of failure. Are you using a Multi-Master setup or have taken another
approach to handle failover.
Regards,
Anil.




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Approach to Auto Scaling Flink Job

Rong Rong
Hi Anil,

The reason why we are using Docker is because internally we support Dockerized container for microservices. 

Ideally speaking this can be any external service running on something other than the actual YARN cluster you Flink application resides. Basically watchdog runs outside of the Flink cluster: watchdog is designed to capture failures that is not self-recoverable by YARN/Flink alone, for example a schema evolution in source/sink; corrupted data that needs to be skipped; etc. because of this nature, it does not make sense to run it on the same YARN cluster.

We have enabled HA in Flink's JM now but not at the time of the presentation. 
I CCed Peter who might be able to answer this question better.

Thanks,
Rong


On Sat, May 11, 2019 at 10:12 PM Anil <[hidden email]> wrote:
Thanks Rong. FlinkForward talk was insightful.
One more question, it's mentioned in the talk that the jobs are running on
Yarn and are monitored by containers running on Docker. Can you explain why
is Docker needed here. When we deploy job to Yarn, one Yarn container is
already dedicated for Job Manager which monitors the job. What additional
functionality does Docker provide here.
Also when the jobs are deployed on Yarn, the Master Node becomes a Single
point of failure. Are you using a Multi-Master setup or have taken another
approach to handle failover.
Regards,
Anil.




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Approach to Auto Scaling Flink Job

Anil
Thanks for the clarification Rong!
As per my understanding, the Docker containers monitors the job Flink Job
which are running in Yarn Cluster. Flink JM's have HA enabled. So there's a
standby JM in case the JM fails and in case of TM failure, that TM will be
re-deployed. All good. My concern is what if the Yarn Master node goes down.
Is the Yarn cluster running with Multi-master or in case of failure do you
migrate your job do a different cluster. If so is this failover to a
different cluster built into Athenax.
Regards,
Anil.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Approach to Auto Scaling Flink Job

Rong Rong
Hi Anil,

A typical Yarn Resource Manager setting consist of 2 RM nodes [1] for active/standby setup. 
FYI: We've also shared some practical experiences for the limitation of this setup, and potential redundant fail-save mechanisms in our latest talk[2] in this year's FlinkForward. 

Thanks,
Rong


On Thu, May 16, 2019 at 5:08 AM Anil <[hidden email]> wrote:
Thanks for the clarification Rong!
As per my understanding, the Docker containers monitors the job Flink Job
which are running in Yarn Cluster. Flink JM's have HA enabled. So there's a
standby JM in case the JM fails and in case of TM failure, that TM will be
re-deployed. All good. My concern is what if the Yarn Master node goes down.
Is the Yarn cluster running with Multi-master or in case of failure do you
migrate your job do a different cluster. If so is this failover to a
different cluster built into Athenax.
Regards,
Anil.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/