pyspark: the number of entries in a column which are null or not

September 23, 2020

There's were a handful of resources describing how to count the number of null values in a pyspark dataframe. For example this Stack Overflow thread which asks for the number of null (and nan) values for each column in a pyspark dataframe. Thus given the total number of entries in the dataframe, one could indirectly compute the number of not null values (by taking the difference). However, I sought after a solution which would simultaneously output both the number of null and not null values (for a single column).

Eventually, I wrote the following possible solution:

(
  df
  .select(df.MY_COLUMN_NAME.isNull().alias(MY_COLUMN_NAME))
  .groupby(MY_COLUMN_NAME)
  .count()
  .show()
)

In contrast, the following is the solution with just the count of null values:

(
  df
  .select(count(when(df.MY_COLUMN_NAME.isNull()).alias(MY_COLUMN_NAME))
  .show()
)

Of course the advantage of the latter was it's ability to summarize across every column:

(
  df
  .select([count(when(col(c).isNull()).alias(c) for c in df.columns])
  .show()
)

[20200813][20200923 Edit]

Search This Blog

AoM: Data Science

pyspark: the number of entries in a column which are null or not

Comments

Post a Comment

Popular posts from this blog

PySpark + Anaconda + Jupyter (Windows)

Getting to know... D3