AoM: Data Science

Posts

Showing posts from June, 2020

Getting to know... D3

June 30, 2020

Today I was looking into ObservableHQ and their Quickstart page had many suggested resources. The second of these resources was titled "Learn D3." Since it's been several years since I've worked with D3, I felt it would be appropriate to go through it. In addition, at the time I thought the resource would lean heavily towards learning D3 within the context of an Observable notebook. Instead, the series was dominantly a comprehensive introduction on D3, delivered via a series of Observables, and occasionally touched on some of the benefits of using an Observable notebook. In hindsight, I should have expected the relationship between an Observable notebook and D3 to be more like that between a Jupyter notebook and a Python module: learning to use the features of Jupyter and learning to use a particular Python module are mutually exclusive activities. In any case, below you'll find my notes to each of the sections in the Observable series "Learn D3" by ...

Parquet + PySpark (= Speed)

June 29, 2020

It was mentioned in passing in my previous post that I needed to setup `winutils.exe` in order to save my dataframe to a parquet file. It was not, however, mentioned why I wanted to save the dataframe as a parquet file. In short, the answer is for speed. First, let's load a dataset using pyspark, save it as a parquet file, load it back, and compare the time it takes to count the number of items in the file. Select your favorite medium-sized data file. I happened to be working with a CSV file that was about 7 GB, "data.csv" (composed of about 6.5 million rows and 300 columns). My data had a header, so I ran `df = spark.read.option("header", "true").csv("data.csv")` The default headings had characters not allowed by parquet when saving, so I looped through the columns and use regex to find and replace the set of illegal characters (" ,;{}()\n\t="): import re df2 = df for column in df.columns: new_column = re.sub(...

PySpark + Anaconda + Jupyter (Windows)

June 29, 2020

It seems like just about every six months I need to install PySpark and the experience is never the same. Note that this isn't necessarily the fault of Spark itself. Instead, it's a combination of the many different situations under which Spark can be installed, lack of official documentation for each and every such situation, and me not writing down the steps I took to successfully install it. So today, I decided to write down the steps needed to install the most recent version of PySpark under the conditions in which I currently need it: inside an Anaconda environment on Windows 10. Note that the page which best helped produce the following solution can be found here (Medium article). I later found a second page with similar instructions which can be found here (Towards Data Science article). Steps to Installing PySpark for use with Jupyter This solution assumes Anaconda is already installed, an environment named `test` has already been created, and Jupyter has already...

Using tensorflowjs_converter

June 18, 2020

A friend and I were looking to convert a Tensorflow model to a TensorFlow.js model. My friend had done some prior research and started me off with the following page: Importing a Keras model into TensorFlow.js . The first step, and one which I messed up on, is installing the `tensorflowjs` library into a Python environment: `pip install tensorflowjs`. Although I'm not sure on the reason for the mistake, the fact of the matter is that I ended up installing `tensorflow` (`pip install tensorflow`) and spent a good amount of time wondering why the bash command `tensorflowjs_converter` wasn't working. In any case, in the tutorial linked above, the command takes the following structure: # bash tensorflowjs_converter --input_format keras \ path/to/my_model.h5 \ path/to/tfjs_target_dir Meanwhile, the readme for the tfjs COCO-SSD model supplies a converter command for removing the post process graph as a way of improving performance ...

Getting to know... ObservableHQ

June 11, 2020

A couple of weeks ago, I looked at a demo of ObservableHQ and the first interesting difference between an Observable notebook and a Jupyter Notebook is that the output of an Observable notebook is above the code. In any case, today I signed up for an account and was presented with a Quickstart page with 12 different recommended reads. I went through two of them, A Taste of Observable and Learn D3 . Quickstart A Taste of Observable (7 minute read) So here are some of the items I noticed after going through this notebook: - Can import and reuse object from another notebook (shown near the end of the notebook). In the example given, the imported object refers to local variable called `forecast`. However, since that object is being pulled away from it's source, we must assign ("inject") a new object to the variable name. - Variables can be changed and changes will automatically push forward and update any cells which depend on those variables. - I mentioned earlier tha...