Tuesday, June 2, 2015

ClassNotFound inside a Task on Spark >= 1.3.0

Context - Spark 1.3.0, Custom InputFormat and InputSplit.

Problem - At my work, I had a custom InputSplit definition which had another class A object which then I need to pass it to my Key. I then have a Spark job that reads using my custom InputFormat and things were all fine on Spark 1.2.0. When we upgraded to 1.3.0 things started breaking with the following stack trace.

Caused by: java.lang.ClassNotFoundException: x.y.z.A
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:625)
 at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at org.apache.spark.util.Utils$.deserialize(Utils.scala:80)
 at org.apache.spark.util.Utils.deserialize(Utils.scala)
 at x.y.z.CustomRecordSplit.readFields(CustomRecordSplit.java:91)

Solution - It took me a while to realize that I've been using Spark's Util object to serialize and deserialize the object (x.y.z.A). The fix was very simple

objA = Utils.deserialize(buffer, Utils.getContextOrSparkClassLoader());

Looks like in the earlier versions the __app__.jar is getting added as part of the Executor and Task classloader but not in the latest versions. When I passed the Context ClassLoader to the deserialization it worked perfectly fine. 

Lessons
- Don't use Spark's Util method. Even though the Util is a private[spark] object since I'm accessing it from a Java class the scala package access protection doesn't seem to apply. I never knew that until now. 
- Always use Utils.getContextOrSparkClassLoader() when doing Java Deserialization in Spark.

4 comments:

Unknown said...

Hello and thanks for this post.
I understand using org.apache.spark.util.Utils failed.
But where is the "new" Utils coming from?
Could you please specify the full path of the new Utils?

Thanks,
Giora.

Ashwanth Kumar said...

There isn't any new Utils. If you want to use Utils.serialize / Utils.serialize from your Java code make sure you pass the Utils.getContextOrSparkClassLoader along with it.

Unknown said...

Ah, OK. I'm working om scala. Do you have any idea how this translates into scala code?

Ashwanth Kumar said...

In which you might need to create a new Utils object yourself and could re-use the code from the Spark implementation.

Ref - https://github.com/apache/spark/blob/v1.3.1/core/src/main/scala/org/apache/spark/util/Utils.scala#L148

Post a Comment