Posts

pyspark: the number of entries in a column which are null or not

There's were a handful of resources describing how to count the number of null values in a pyspark dataframe. For example this Stack Overflow thread which asks for the number of null (and nan) values for each column in a pyspark dataframe. Thus given the total number of entries in the dataframe, one could indirectly compute the number of not null values (by taking the difference). However, I sought after a solution which would simultaneously output both the number of null and not null values (for a single column). Eventually, I wrote the following possible solution: ( df .select(df.MY_COLUMN_NAME.isNull().alias(MY_COLUMN_NAME)) .groupby(MY_COLUMN_NAME) .count() .show() ) In contrast, the following is the solution with just the count of null values: ( df .select(count(when(df.MY_COLUMN_NAME.isNull()).alias(MY_COLUMN_NAME)) .show() ) Of course the advantage of the latter was it's ability to summarize across every column: ( df .select([count(when(col(c)...

Observable HQ: dropdown input, d3 transition, and viewof

Over the past few days, I wanted to isolate my understanding of the relationship between the selection made in a dropdown form input and other cells which depend on that selection. This exploration started with drawing a rectangle and using code from example notebooks with dropdown selections to cause a transition in the chart to occur (see Arc Diagram ). One added complexity with Mike Bostock's Arc Diagram notebook is a timeout that's added which will automatically trigger a change in the dropdown selection. Originally this code block had a bug which caused me some confusion, but it has been a valuable learning experience: Mike Bostock commented on the bug and fixed the bug, so I was able to update my notebook accordingly. In any case, contained in the notebook are three charts and each chart behaves different based on whether I refer to the dropdown form object by `myNumber` or `viewof myNumber.value` as well as where I make such references. In the first chart, I referenc...

Getting to know... D3

Today I was looking into ObservableHQ and their Quickstart page had many suggested resources. The second of these resources was titled "Learn D3." Since it's been several years since I've worked with D3, I felt it would be appropriate to go through it. In addition, at the time I thought the resource would lean heavily towards learning D3 within the context of an Observable notebook. Instead, the series was dominantly a comprehensive introduction on D3, delivered via a series of Observables, and occasionally touched on some of the benefits of using an Observable notebook. In hindsight, I should have expected the relationship between an Observable notebook and D3 to be more like that between a Jupyter notebook and a Python module: learning to use the features of Jupyter and learning to use a particular Python module are mutually exclusive activities. In any case, below you'll find my notes to each of the sections in the Observable series "Learn D3" by ...

Parquet + PySpark (= Speed)

It was mentioned in passing in my previous post that I needed to setup `winutils.exe` in order to save my dataframe to a parquet file. It was not, however, mentioned why I wanted to save the dataframe as a parquet file. In short, the answer is for speed. First, let's load a dataset using pyspark, save it as a parquet file, load it back, and compare the time it takes to count the number of items in the file. Select your favorite medium-sized data file. I happened to be working with a CSV file that was about 7 GB, "data.csv" (composed of about 6.5 million rows and 300 columns). My data had a header, so I ran `df = spark.read.option("header", "true").csv("data.csv")` The default headings had characters not allowed by parquet when saving, so I looped through the columns and use regex to find and replace the set of illegal characters (" ,;{}()\n\t="): import re df2 = df for column in df.columns: new_column = re.sub(...

PySpark + Anaconda + Jupyter (Windows)

It seems like just about every six months I need to install PySpark and the experience is never the same. Note that this isn't necessarily the fault of Spark itself. Instead, it's a combination of the many different situations under which Spark can be installed, lack of official documentation for each and every such situation, and me not writing down the steps I took to successfully install it. So today, I decided to write down the steps needed to install the most recent version of PySpark under the conditions in which I currently need it: inside an Anaconda environment on Windows 10. Note that the page which best helped produce the following solution can be found here (Medium article). I later found a second page with similar instructions which can be found here (Towards Data Science article). Steps to Installing PySpark for use with Jupyter This solution assumes Anaconda is already installed, an environment named `test` has already been created, and Jupyter has already...

Using tensorflowjs_converter

A friend and I were looking to convert a Tensorflow model to a TensorFlow.js model. My friend had done some prior research and started me off with the following page: Importing a Keras model into TensorFlow.js . The first step, and one which I messed up on, is installing the `tensorflowjs` library into a Python environment: `pip install tensorflowjs`. Although I'm not sure on the reason for the mistake, the fact of the matter is that I ended up installing `tensorflow` (`pip install tensorflow`) and spent a good amount of time wondering why the bash command `tensorflowjs_converter` wasn't working. In any case, in the tutorial linked above, the command takes the following structure: # bash tensorflowjs_converter --input_format keras \ path/to/my_model.h5 \ path/to/tfjs_target_dir Meanwhile, the readme for the tfjs COCO-SSD model supplies a converter command for removing the post process graph as a way of improving performance ...

Getting to know... ObservableHQ

A couple of weeks ago, I looked at a demo of ObservableHQ and the first interesting difference between an Observable notebook and a Jupyter Notebook is that the output of an Observable notebook is above the code. In any case, today I signed up for an account and was presented with a Quickstart page with 12 different recommended reads. I went through two of them, A Taste of Observable and Learn D3 . Quickstart A Taste of Observable (7 minute read) So here are some of the items I noticed after going through this notebook: - Can import and reuse object from another notebook (shown near the end of the notebook). In the example given, the imported object refers to local variable called `forecast`. However, since that object is being pulled away from it's source, we must assign ("inject") a new object to the variable name. - Variables can be changed and changes will automatically push forward and update any cells which depend on those variables. - I mentioned earlier tha...