NoClassDefFoundError in failing-restarting job that uses url classloader

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

NoClassDefFoundError in failing-restarting job that uses url classloader

Subramanyam Ramanathan

 

Hello,

 

I'm currently using flink 1.7.2.

 

I'm trying to run a job that's submitted programmatically using the ClusterClient API.

               public JobSubmissionResult run(PackagedProgram prog, int parallelism)

 

 

The job makes use of some jars which I add to the packaged program through the Packaged constructor, along with the Jar file.

   public PackagedProgram(File jarFile, List<URL> classpaths, String... args)

Normally, This works perfectly and the job runs fine.

 

However, if there's an error in the job, and the job goes into failing state and when it's continously  trying to restart the job for an hour or so, I notice a NoClassDefFoundError for some classes in the jars that I load using the URL class loader and the job never recovers after that, even if the root cause of the issue was fixed (I had a kafka source/sink in my job, and kafka was down temporarily, and was brought up after that).

The jar is still available at the path referenced by the url classloader and is not tampered with.

 

Could anyone please give me some pointers with regard to the reason why this could happen/what I could be missing here/how can I debug further ?

 

thanks

Subbu

 

 

Reply | Threaded
Open this post in threaded view
|

Re: NoClassDefFoundError in failing-restarting job that uses url classloader

朱翥
Hi Subramanyam,

Could you share more information? including:
1. the URL pattern
2. the detailed exception and the log around it
3. the cluster the job is running on, e.g. standalone, yarn, k8s
4. it's session mode or per job mode

This information would be helpful to identify the failure cause.

Thanks,
Zhu Zhu











Subramanyam Ramanathan <[hidden email]> 于2019年8月9日周五 上午1:45写道:

 

Hello,

 

I'm currently using flink 1.7.2.

 

I'm trying to run a job that's submitted programmatically using the ClusterClient API.

               public JobSubmissionResult run(PackagedProgram prog, int parallelism)

 

 

The job makes use of some jars which I add to the packaged program through the Packaged constructor, along with the Jar file.

   public PackagedProgram(File jarFile, List<URL> classpaths, String... args)

Normally, This works perfectly and the job runs fine.

 

However, if there's an error in the job, and the job goes into failing state and when it's continously  trying to restart the job for an hour or so, I notice a NoClassDefFoundError for some classes in the jars that I load using the URL class loader and the job never recovers after that, even if the root cause of the issue was fixed (I had a kafka source/sink in my job, and kafka was down temporarily, and was brought up after that).

The jar is still available at the path referenced by the url classloader and is not tampered with.

 

Could anyone please give me some pointers with regard to the reason why this could happen/what I could be missing here/how can I debug further ?

 

thanks

Subbu

 

 

Reply | Threaded
Open this post in threaded view
|

RE: NoClassDefFoundError in failing-restarting job that uses url classloader

Subramanyam Ramanathan

Hi.

 

1)      The url pattern example : <a href="file:///\\root\flink-test\lib\dependency.jar">file:///root/flink-test/lib/dependency.jar

2)      I’m trying to simulate the same issue on a separate flink installation with a sample job so that I can share the logs. (However so far I’ve been unable to simulate it. Though on our product setup it can be simulated quite frequently. )

3)      The job is running in standalone mode. We have separate k8s pods with our own images which incorporate the taskmanager and jobmanager for our product. A 3rd pod connects using k8s and submits the job

4)      Per job mode

 

I’m trying to simulate the issue on a separate flink installation outside of our produce env. I’ll update as soon as I have results.

 

Thanks,

Subbu

 

From: Zhu Zhu [mailto:[hidden email]]
Sent: Friday, August 9, 2019 7:43 AM
To: Subramanyam Ramanathan <[hidden email]>
Cc: [hidden email]
Subject: Re: NoClassDefFoundError in failing-restarting job that uses url classloader

 

Hi Subramanyam,

 

Could you share more information? including:

1. the URL pattern

2. the detailed exception and the log around it

3. the cluster the job is running on, e.g. standalone, yarn, k8s

4. it's session mode or per job mode

 

This information would be helpful to identify the failure cause.

 

Thanks,

Zhu Zhu

 

 

 

 

 

 

 

 

 

 

 

Subramanyam Ramanathan <[hidden email]> 201989日周五 上午1:45写道:

 

Hello,

 

I'm currently using flink 1.7.2.

 

I'm trying to run a job that's submitted programmatically using the ClusterClient API.

               public JobSubmissionResult run(PackagedProgram prog, int parallelism)

 

 

The job makes use of some jars which I add to the packaged program through the Packaged constructor, along with the Jar file.

   public PackagedProgram(File jarFile, List<URL> classpaths, String... args)

Normally, This works perfectly and the job runs fine.

 

However, if there's an error in the job, and the job goes into failing state and when it's continously  trying to restart the job for an hour or so, I notice a NoClassDefFoundError for some classes in the jars that I load using the URL class loader and the job never recovers after that, even if the root cause of the issue was fixed (I had a kafka source/sink in my job, and kafka was down temporarily, and was brought up after that).

The jar is still available at the path referenced by the url classloader and is not tampered with.

 

Could anyone please give me some pointers with regard to the reason why this could happen/what I could be missing here/how can I debug further ?

 

thanks

Subbu

 

 

Reply | Threaded
Open this post in threaded view
|

Re: NoClassDefFoundError in failing-restarting job that uses url classloader

朱翥
Hi Subramanyam,

I think the standalone per job mode does not invoke PackagedProgram(File jarFile, List<URL> classpaths, String... args) to generate PackagedProgram, and thus does not add extra classpaths to the job.

Regarding the NoClassDefFoundError, there is another possibility that the class file exists but it has some static initialization process which may fail. This can also lead to the class to not be loaded and cause NoClassDefFoundError.

Thanks,
Zhu Zhu


Subramanyam Ramanathan <[hidden email]> 于2019年8月10日周六 下午2:38写道:

Hi.

 

1)      The url pattern example : file:///root/flink-test/lib/dependency.jar

2)      I’m trying to simulate the same issue on a separate flink installation with a sample job so that I can share the logs. (However so far I’ve been unable to simulate it. Though on our product setup it can be simulated quite frequently. )

3)      The job is running in standalone mode. We have separate k8s pods with our own images which incorporate the taskmanager and jobmanager for our product. A 3rd pod connects using k8s and submits the job

4)      Per job mode

 

I’m trying to simulate the issue on a separate flink installation outside of our produce env. I’ll update as soon as I have results.

 

Thanks,

Subbu

 

From: Zhu Zhu [mailto:[hidden email]]
Sent: Friday, August 9, 2019 7:43 AM
To: Subramanyam Ramanathan <[hidden email]>
Cc: [hidden email]
Subject: Re: NoClassDefFoundError in failing-restarting job that uses url classloader

 

Hi Subramanyam,

 

Could you share more information? including:

1. the URL pattern

2. the detailed exception and the log around it

3. the cluster the job is running on, e.g. standalone, yarn, k8s

4. it's session mode or per job mode

 

This information would be helpful to identify the failure cause.

 

Thanks,

Zhu Zhu

 

 

 

 

 

 

 

 

 

 

 

Subramanyam Ramanathan <[hidden email]> 201989日周五 上午1:45写道:

 

Hello,

 

I'm currently using flink 1.7.2.

 

I'm trying to run a job that's submitted programmatically using the ClusterClient API.

               public JobSubmissionResult run(PackagedProgram prog, int parallelism)

 

 

The job makes use of some jars which I add to the packaged program through the Packaged constructor, along with the Jar file.

   public PackagedProgram(File jarFile, List<URL> classpaths, String... args)

Normally, This works perfectly and the job runs fine.

 

However, if there's an error in the job, and the job goes into failing state and when it's continously  trying to restart the job for an hour or so, I notice a NoClassDefFoundError for some classes in the jars that I load using the URL class loader and the job never recovers after that, even if the root cause of the issue was fixed (I had a kafka source/sink in my job, and kafka was down temporarily, and was brought up after that).

The jar is still available at the path referenced by the url classloader and is not tampered with.

 

Could anyone please give me some pointers with regard to the reason why this could happen/what I could be missing here/how can I debug further ?

 

thanks

Subbu