One basic premise of Spark seems to be to “hide” exceptions for the most part and just keep things running as much as possible. This can lead to some unexpected results as there is no indicator that things have gone awry, but indeed they have! I came across this simple example recently and thought I would share.
Say you are reading a table into a dataframe and wanting to filter out some of the results, but you assumed the types in the columns were correct so you didn’t check. You can encounter a situation like this:
val my_df = spark.read.table("database.my_table")
.select('id,'name)
display(my_df)
id | name |
12345 | Sue |
22334 | John |
Kassandra | |
66787 | Phillip |
So in the result set, we see three normal id values and one empty (i.e. “”) id. We know our id column is supposed to be a String, but this table was accidentally generated having it as an Integer (easy enough to happen if you rely on inferSchema). Then we can see behavior like this when trying to count the non-empty id rows.
my_df.filter('id =!= "").count // Count non-empty ids
// res1: Long: 0
While this is likely because the comparison itself is invalid (=!= “” isn’t a valid Integer check), spark just chugs along without any obvious warnings and gives a result, albeit one that is most definitely not correct! So not only does one need to be careful with their schemas, but don’t rely on Spark to warn you about or catch your mistakes either!