Context - Spark 1.3.0, Custom InputFormat and InputSplit.
Problem - At my work, I had a custom InputSplit definition which had another class A object which then I need to pass it to my Key. I then have a Spark job that reads using my custom InputFormat and things were all fine on Spark 1.2.0. When we upgraded to 1.3.0 things started breaking with the following stack trace.
Problem - At my work, I had a custom InputSplit definition which had another class A object which then I need to pass it to my Key. I then have a Spark job that reads using my custom InputFormat and things were all fine on Spark 1.2.0. When we upgraded to 1.3.0 things started breaking with the following stack trace.
Caused by: java.lang.ClassNotFoundException: x.y.z.A at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:625) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.util.Utils$.deserialize(Utils.scala:80) at org.apache.spark.util.Utils.deserialize(Utils.scala) at x.y.z.CustomRecordSplit.readFields(CustomRecordSplit.java:91)
Solution - It took me a while to realize that I've been using Spark's Util object to serialize and deserialize the object (x.y.z.A). The fix was very simple
objA = Utils.deserialize(buffer, Utils.getContextOrSparkClassLoader());
Looks like in the earlier versions the __app__.jar is getting added as part of the Executor and Task classloader but not in the latest versions. When I passed the Context ClassLoader to the deserialization it worked perfectly fine.
Lessons
- Don't use Spark's Util method. Even though the Util is a private[spark] object since I'm accessing it from a Java class the scala package access protection doesn't seem to apply. I never knew that until now.
- Always use Utils.getContextOrSparkClassLoader() when doing Java Deserialization in Spark.
4 comments:
Hello and thanks for this post.
I understand using org.apache.spark.util.Utils failed.
But where is the "new" Utils coming from?
Could you please specify the full path of the new Utils?
Thanks,
Giora.
There isn't any new Utils. If you want to use Utils.serialize / Utils.serialize from your Java code make sure you pass the Utils.getContextOrSparkClassLoader along with it.
Ah, OK. I'm working om scala. Do you have any idea how this translates into scala code?
In which you might need to create a new Utils object yourself and could re-use the code from the Spark implementation.
Ref - https://github.com/apache/spark/blob/v1.3.1/core/src/main/scala/org/apache/spark/util/Utils.scala#L148
Post a Comment