pyspark: the number of entries in a column which are null or not
There's were a handful of resources describing how to count the number of null values in a pyspark dataframe. For example this Stack Overflow thread which asks for the number of null (and nan) values for each column in a pyspark dataframe. Thus given the total number of entries in the dataframe, one could indirectly compute the number of not null values (by taking the difference). However, I sought after a solution which would simultaneously output both the number of null and not null values (for a single column).
Eventually, I wrote the following possible solution:
In contrast, the following is the solution with just the count of null values:
Of course the advantage of the latter was it's ability to summarize across every column:
[20200813][20200923 Edit]
Eventually, I wrote the following possible solution:
( df .select(df.MY_COLUMN_NAME.isNull().alias(MY_COLUMN_NAME)) .groupby(MY_COLUMN_NAME) .count() .show() )
In contrast, the following is the solution with just the count of null values:
( df .select(count(when(df.MY_COLUMN_NAME.isNull()).alias(MY_COLUMN_NAME)) .show() )
Of course the advantage of the latter was it's ability to summarize across every column:
( df .select([count(when(col(c).isNull()).alias(c) for c in df.columns]) .show() )
[20200813][20200923 Edit]
Comments
Post a Comment