In my previous post, I demonstrated how Spark creates and serializes tasks. In this post, I show how to utilize this knowledge to construct Spark applications in a maintainable and upgradable way, where at the same time “task not serializable” exceptions are avoided.
When I participated in a big data project, I needed to program Spark applications to move and transform data from/to relational and distributed databases, like Apache Hive. I found such applications to have a number of pitfalls, so all “hard to read code,” “method is too large to fit into a single screen,” etc. problems need to be avoided for us to focus on deeper issues. Also, Spark jobs are similar: data is loaded from a single or multiple databases, gets transformed, then saved to a single or multiple databases. So it seems reasonable to try to use GoF patterns to program Spark applications.