Re: checkpoint failure in forever loop suddenly even state size less than 1 mb

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Re: checkpoint failure in forever loop suddenly even state size less than 1 mb

Fabian Hueske-2
Hi Sushant,

It's hard to tell what's going on.
Maybe the thread pool of the async io operator is too small for the ingested data rate?
This could cause the backpressure on the source and eventually also the failing checkpoints.

Which Flink version are you using?

Best, Fabian

Am Do., 29. Aug. 2019 um 12:07 Uhr schrieb Sushant Sawant <[hidden email]>:
Hi Fabian,
Sorry for one to one mail.
Could you help me out with this m stuck with this issue over a week now.

Thanks & Regards,
Sushant Sawant

On Tue, 27 Aug 2019, 15:23 Sushant Sawant, <[hidden email]> wrote:

Hi, firstly thanks for replying.

Here it is.. configuration related to checkpoint.

CheckpointingMode checkpointMode = CheckpointingMode.valueOf(‘AT_LEAST_ONCE’);             

Long checkpointInterval = Long.valueOf(parameterMap.get(Checkpoint.CHECKPOINT_INTERVAL.getKey()));

StateBackend sb=new FsStateBackend(file:////);


env.enableCheckpointing(300000, checkpointMode);




Thanks & Regards,
Sushant Sawant

On Tue, 27 Aug 2019, 14:09 [hidden email], <[hidden email]> wrote:
Hi,What's your checkpoint config?

Date: 2019-08-27 15:31
Subject: Re: checkpoint failure suddenly even state size less than 1 mb
Hi team,
Anyone for help/suggestion, now we have stopped all input in kafka, there is no processing, no sink but checkpointing is failing. 
Is it like once checkpoint fails it keeps failing forever until job restart.

Help appreciated.

Thanks & Regards,
Sushant Sawant

On 23 Aug 2019 12:56 p.m., "Sushant Sawant" <[hidden email]> wrote:
Hi all,
m facing two issues which I believe are co-related though.
1. Kafka source shows high back pressure.
2. Sudden checkpoint failure for entire day until restart.

My job does following thing,
a. Read from Kafka
b. Asyncio to external system
c. Dumping in Cassandra, Elasticsearch

Checkpointing is using file system.
This flink job is proven under high load,
around 5000/sec throughput.
But recently we scaled down parallelism since, there wasn't any load in production and these issues started.

Please find the status shown by flink dashboard.
The github folder contains image where there was high back pressure and checkpoint failure
and  after restart, "everything is fine" images in this folder,

Could anyone point me towards direction what would have went wrong/ trouble shooting??

Thanks & Regards,
Sushant Sawant