long lived standalone job session cluster in kubernetes

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

long lived standalone job session cluster in kubernetes

Derek VerLee

I'm looking at the job cluster mode, it looks great and I and considering migrating our jobs off our "legacy" session cluster and into Kubernetes.

I do need to ask some questions because I haven't found a lot of details in the documentation about how it works yet, and I gave up following the the DI around in the code after a while.

Let's say I have a deployment for the job "leader" in HA with ZK, and another deployment for the taskmanagers.

I want to upgrade the code or configuration and start from a savepoint, in an automated way.

Best I can figure, I can not just update the deployment resources in kubernetes and allow the containers to restart in an arbitrary order.

Instead, I expect sequencing is important, something along the lines of this:

1. issue savepoint command on leader
2. wait for savepoint
3. destroy all leader and taskmanager containers
4. deploy new leader, with savepoint url
5. deploy new taskmanagers


For example, I imagine old taskmanagers (with an old version of my job) attaching to the new leader and causing a problem.

Does that sound right, or am I overthinking it?

If not, has anyone tried implementing any automation for this yet?

Reply | Threaded
Open this post in threaded view
|

Re: long lived standalone job session cluster in kubernetes

Dawid Wysakowicz-2
Hi Derek,

I am not an expert in kubernetes, so I will cc Till, who should be able
to help you more.

As for the automation for similar process I would recommend having a
look at dA platform[1] which is built on top of kubernetes.

Best,

Dawid

[1] https://data-artisans.com/platform-overview

On 30/11/2018 02:10, Derek VerLee wrote:

>
> I'm looking at the job cluster mode, it looks great and I and
> considering migrating our jobs off our "legacy" session cluster and
> into Kubernetes.
>
> I do need to ask some questions because I haven't found a lot of
> details in the documentation about how it works yet, and I gave up
> following the the DI around in the code after a while.
>
> Let's say I have a deployment for the job "leader" in HA with ZK, and
> another deployment for the taskmanagers.
>
> I want to upgrade the code or configuration and start from a
> savepoint, in an automated way.
>
> Best I can figure, I can not just update the deployment resources in
> kubernetes and allow the containers to restart in an arbitrary order.
>
> Instead, I expect sequencing is important, something along the lines
> of this:
>
> 1. issue savepoint command on leader
> 2. wait for savepoint
> 3. destroy all leader and taskmanager containers
> 4. deploy new leader, with savepoint url
> 5. deploy new taskmanagers
>
>
> For example, I imagine old taskmanagers (with an old version of my
> job) attaching to the new leader and causing a problem.
>
> Does that sound right, or am I overthinking it?
>
> If not, has anyone tried implementing any automation for this yet?
>


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: long lived standalone job session cluster in kubernetes

Andrey Zagrebin
Hi Derek,

I think your automation steps look good.
Recreating deployments should not take long
and as you mention, this way you can avoid unpredictable old/new version collisions.

Best,
Andrey

> On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote:
>
> Hi Derek,
>
> I am not an expert in kubernetes, so I will cc Till, who should be able
> to help you more.
>
> As for the automation for similar process I would recommend having a
> look at dA platform[1] which is built on top of kubernetes.
>
> Best,
>
> Dawid
>
> [1] https://data-artisans.com/platform-overview
>
> On 30/11/2018 02:10, Derek VerLee wrote:
>>
>> I'm looking at the job cluster mode, it looks great and I and
>> considering migrating our jobs off our "legacy" session cluster and
>> into Kubernetes.
>>
>> I do need to ask some questions because I haven't found a lot of
>> details in the documentation about how it works yet, and I gave up
>> following the the DI around in the code after a while.
>>
>> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> another deployment for the taskmanagers.
>>
>> I want to upgrade the code or configuration and start from a
>> savepoint, in an automated way.
>>
>> Best I can figure, I can not just update the deployment resources in
>> kubernetes and allow the containers to restart in an arbitrary order.
>>
>> Instead, I expect sequencing is important, something along the lines
>> of this:
>>
>> 1. issue savepoint command on leader
>> 2. wait for savepoint
>> 3. destroy all leader and taskmanager containers
>> 4. deploy new leader, with savepoint url
>> 5. deploy new taskmanagers
>>
>>
>> For example, I imagine old taskmanagers (with an old version of my
>> job) attaching to the new leader and causing a problem.
>>
>> Does that sound right, or am I overthinking it?
>>
>> If not, has anyone tried implementing any automation for this yet?
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: long lived standalone job session cluster in kubernetes

Till Rohrmann
Hi Derek,

what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.


Cheers,
Till

On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <[hidden email]> wrote:
Hi Derek,

I think your automation steps look good.
Recreating deployments should not take long
and as you mention, this way you can avoid unpredictable old/new version collisions.

Best,
Andrey

> On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote:
>
> Hi Derek,
>
> I am not an expert in kubernetes, so I will cc Till, who should be able
> to help you more.
>
> As for the automation for similar process I would recommend having a
> look at dA platform[1] which is built on top of kubernetes.
>
> Best,
>
> Dawid
>
> [1] https://data-artisans.com/platform-overview
>
> On 30/11/2018 02:10, Derek VerLee wrote:
>>
>> I'm looking at the job cluster mode, it looks great and I and
>> considering migrating our jobs off our "legacy" session cluster and
>> into Kubernetes.
>>
>> I do need to ask some questions because I haven't found a lot of
>> details in the documentation about how it works yet, and I gave up
>> following the the DI around in the code after a while.
>>
>> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> another deployment for the taskmanagers.
>>
>> I want to upgrade the code or configuration and start from a
>> savepoint, in an automated way.
>>
>> Best I can figure, I can not just update the deployment resources in
>> kubernetes and allow the containers to restart in an arbitrary order.
>>
>> Instead, I expect sequencing is important, something along the lines
>> of this:
>>
>> 1. issue savepoint command on leader
>> 2. wait for savepoint
>> 3. destroy all leader and taskmanager containers
>> 4. deploy new leader, with savepoint url
>> 5. deploy new taskmanagers
>>
>>
>> For example, I imagine old taskmanagers (with an old version of my
>> job) attaching to the new leader and causing a problem.
>>
>> Does that sound right, or am I overthinking it?
>>
>> If not, has anyone tried implementing any automation for this yet?
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: long lived standalone job session cluster in kubernetes

Derek VerLee

Sounds good.

Is someone working on this automation today?

If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.


On 12/4/18 5:35 AM, Till Rohrmann wrote:
Hi Derek,

what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.


Cheers,
Till

On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <[hidden email]> wrote:
Hi Derek,

I think your automation steps look good.
Recreating deployments should not take long
and as you mention, this way you can avoid unpredictable old/new version collisions.

Best,
Andrey

> On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote:
>
> Hi Derek,
>
> I am not an expert in kubernetes, so I will cc Till, who should be able
> to help you more.
>
> As for the automation for similar process I would recommend having a
> look at dA platform[1] which is built on top of kubernetes.
>
> Best,
>
> Dawid
>
> [1] https://data-artisans.com/platform-overview
>
> On 30/11/2018 02:10, Derek VerLee wrote:
>>
>> I'm looking at the job cluster mode, it looks great and I and
>> considering migrating our jobs off our "legacy" session cluster and
>> into Kubernetes.
>>
>> I do need to ask some questions because I haven't found a lot of
>> details in the documentation about how it works yet, and I gave up
>> following the the DI around in the code after a while.
>>
>> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> another deployment for the taskmanagers.
>>
>> I want to upgrade the code or configuration and start from a
>> savepoint, in an automated way.
>>
>> Best I can figure, I can not just update the deployment resources in
>> kubernetes and allow the containers to restart in an arbitrary order.
>>
>> Instead, I expect sequencing is important, something along the lines
>> of this:
>>
>> 1. issue savepoint command on leader
>> 2. wait for savepoint
>> 3. destroy all leader and taskmanager containers
>> 4. deploy new leader, with savepoint url
>> 5. deploy new taskmanagers
>>
>>
>> For example, I imagine old taskmanagers (with an old version of my
>> job) attaching to the new leader and causing a problem.
>>
>> Does that sound right, or am I overthinking it?
>>
>> If not, has anyone tried implementing any automation for this yet?
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: long lived standalone job session cluster in kubernetes

Till Rohrmann
Hi Derek,

there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out.


Cheers,
Till

On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <[hidden email]> wrote:

Sounds good.

Is someone working on this automation today?

If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.


On 12/4/18 5:35 AM, Till Rohrmann wrote:
Hi Derek,

what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.


Cheers,
Till

On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <[hidden email]> wrote:
Hi Derek,

I think your automation steps look good.
Recreating deployments should not take long
and as you mention, this way you can avoid unpredictable old/new version collisions.

Best,
Andrey

> On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote:
>
> Hi Derek,
>
> I am not an expert in kubernetes, so I will cc Till, who should be able
> to help you more.
>
> As for the automation for similar process I would recommend having a
> look at dA platform[1] which is built on top of kubernetes.
>
> Best,
>
> Dawid
>
> [1] https://data-artisans.com/platform-overview
>
> On 30/11/2018 02:10, Derek VerLee wrote:
>>
>> I'm looking at the job cluster mode, it looks great and I and
>> considering migrating our jobs off our "legacy" session cluster and
>> into Kubernetes.
>>
>> I do need to ask some questions because I haven't found a lot of
>> details in the documentation about how it works yet, and I gave up
>> following the the DI around in the code after a while.
>>
>> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> another deployment for the taskmanagers.
>>
>> I want to upgrade the code or configuration and start from a
>> savepoint, in an automated way.
>>
>> Best I can figure, I can not just update the deployment resources in
>> kubernetes and allow the containers to restart in an arbitrary order.
>>
>> Instead, I expect sequencing is important, something along the lines
>> of this:
>>
>> 1. issue savepoint command on leader
>> 2. wait for savepoint
>> 3. destroy all leader and taskmanager containers
>> 4. deploy new leader, with savepoint url
>> 5. deploy new taskmanagers
>>
>>
>> For example, I imagine old taskmanagers (with an old version of my
>> job) attaching to the new leader and causing a problem.
>>
>> Does that sound right, or am I overthinking it?
>>
>> If not, has anyone tried implementing any automation for this yet?
>>
>