Context - Spark 1.3.0, Custom InputFormat and InputSplit.
Problem - At my work, I had a custom InputSplit definition which had another class A object which then I need to pass it to my Key. I then have a Spark job that reads using my custom InputFormat and things were all fine on Spark 1.2.0. When we upgraded to 1.3.0 things started breaking with the following stack trace.
Problem - At my work, I had a custom InputSplit definition which had another class A object which then I need to pass it to my Key. I then have a Spark job that reads using my custom InputFormat and things were all fine on Spark 1.2.0. When we upgraded to 1.3.0 things started breaking with the following stack trace.
Caused by: java.lang.ClassNotFoundException: x.y.z.A at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:625) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.util.Utils$.deserialize(Utils.scala:80) at org.apache.spark.util.Utils.deserialize(Utils.scala) at x.y.z.CustomRecordSplit.readFields(CustomRecordSplit.java:91)
Solution - It took me a while to realize that I've been using Spark's Util object to serialize and deserialize the object (x.y.z.A). The fix was very simple
objA = Utils.deserialize(buffer, Utils.getContextOrSparkClassLoader());
Looks like in the earlier versions the __app__.jar is getting added as part of the Executor and Task classloader but not in the latest versions. When I passed the Context ClassLoader to the deserialization it worked perfectly fine.
Lessons
- Don't use Spark's Util method. Even though the Util is a private[spark] object since I'm accessing it from a Java class the scala package access protection doesn't seem to apply. I never knew that until now.
- Always use Utils.getContextOrSparkClassLoader() when doing Java Deserialization in Spark.