Posts

Showing posts from September, 2020

pyspark: the number of entries in a column which are null or not

There's were a handful of resources describing how to count the number of null values in a pyspark dataframe. For example this Stack Overflow thread which asks for the number of null (and nan) values for each column in a pyspark dataframe. Thus given the total number of entries in the dataframe, one could indirectly compute the number of not null values (by taking the difference). However, I sought after a solution which would simultaneously output both the number of null and not null values (for a single column). Eventually, I wrote the following possible solution: ( df .select(df.MY_COLUMN_NAME.isNull().alias(MY_COLUMN_NAME)) .groupby(MY_COLUMN_NAME) .count() .show() ) In contrast, the following is the solution with just the count of null values: ( df .select(count(when(df.MY_COLUMN_NAME.isNull()).alias(MY_COLUMN_NAME)) .show() ) Of course the advantage of the latter was it's ability to summarize across every column: ( df .select([count(when(col(c)