Be Careful With Column Types in Spark Filters

One basic premise of Spark seems to be to “hide” exceptions for the most part and just keep things running as much as possible. This can lead to some unexpected results as there is no indicator that things have gone awry, but indeed they have! I came across this simple example recently and thought I would share.

Say you are reading a table into a dataframe and wanting to filter out some of the results, but you assumed the types in the columns were correct so you didn’t check. You can encounter a situation like this:

val my_df = spark.read.table("database.my_table")
  .select('id,'name)

display(my_df)
idname
12345Sue
22334John
Kassandra
66787Phillip

So in the result set, we see three normal id values and one empty (i.e. “”) id. We know our id column is supposed to be a String, but this table was accidentally generated having it as an Integer (easy enough to happen if you rely on inferSchema). Then we can see behavior like this when trying to count the non-empty id rows.

my_df.filter('id =!= "").count // Count non-empty ids

// res1: Long: 0

While this is likely because the comparison itself is invalid (=!= “” isn’t a valid Integer check), spark just chugs along without any obvious warnings and gives a result, albeit one that is most definitely not correct! So not only does one need to be careful with their schemas, but don’t rely on Spark to warn you about or catch your mistakes either!

Colorizing Bash Output

When I developed this, a great majority of my professional development was in Perl.  Additionally, we have had a focus on building more robust unit test scripts for the code that is written.  Especially as the tests have started growing pretty large, I started looking for ways to make the test output a bit more readable.  Since these are built using existing perl packages like Test::More, I am somewhat limited in my customization choices.  When looking for possibilities, most of the answers that I found were code that you could add to your script to do the colorization.  Unless I wanted to build custom versions of the packages we use, that wasn’t the best option.

I also looked into multitail since I had used that for colorizing some log outputs, but it never seemed well suited for taking the output of a single-run script.  Not only is scrolling around a bit awkward, but it doesn’t tend to like to leave the output available after the run. Multitail does ok if you first redirect the output to a file, then multitail that file, but that is a bit more difficult to swing if you don’t want be switching between windows (plus it leaves a file that you’ll need to clean up at some point).

What I ended up going with was this script.  The nice thing about it is that it could be pretty easily tailored for whatever script for which you wanted to colorize the output. As it is currently written, it makes the output of Perl’s Test::More .t files a lot clearer and nicer to look at!

To use it, I run something like perl test.t 2>&1 | ./colorize.sh which results in output like this.

I’m sure that I could do some more customization, but I was pretty happy with how it turned out. It colorizes the different statements that we tend to have, and catches kill signals so your colors will reset to normal if you kill the run before it finishes. If I do make any updates, I’ll add them to the snippet.