Parquet + PySpark (= Speed)

It was mentioned in passing in my previous post that I needed to setup `winutils.exe` in order to save my dataframe to a parquet file. It was not, however, mentioned why I wanted to save the dataframe as a parquet file. In short, the answer is for speed.

First, let's load a dataset using pyspark, save it as a parquet file, load it back, and compare the time it takes to count the number of items in the file.

Select your favorite medium-sized data file. I happened to be working with a CSV file that was about 7 GB, "data.csv" (composed of about 6.5 million rows and 300 columns).

My data had a header, so I ran
`df ="header", "true").csv("data.csv")`

The default headings had characters not allowed by parquet when saving, so I looped through the columns and use regex to find and replace the set of illegal characters (" ,;{}()\n\t="):
import re

df2 = df
for column in df.columns:
new_column = re.sub("[ ,;{}()\n\t=]", "", column)
df2 = df2.withColumnRenamed(column, new_column)

To save the file as a parquet, we run

Once it's been saved, we can load it back into the notebook with the following:
`parquetFile ="data.parquet")`

Finally, for an example comparison, we compare `%timeit df.count()` and `%timeit parquetFile.count()`. Over 7 runs, 1oop each, they ran 20.6 s +/- 1.11 s per loop and 1.01 s +/- 123 ms per loop, respectively. That's a huge difference!*

Note that we although we spent an initial upfront cost of 7 min 15 s to save the dataframe as a parquet file, we benefit in the long-run after sufficiently many operations. In terms of the `count()` method, it would take about 20 uses of `count()` to even out on the initial cost.

*I later ran a second set and the second set came up 26.2 s +/- 389 ms per loop and 933 ms +/- 86.4 ms per loop, respectively.

Load times were similar. Loading from the CSV would take 163 +/- 29.7 ms per loop while loading from the parquet file takes 153 ms +/- 52.8 ms per loop.

While the original CSV is about 7 GB, the parquet file is just 770 MB.

On June 25, 2018, I added a draft post to my notes blog titled "Parquet is Efficient." The draft never got filled. However, I remember working with Spark and performing some joins which took a long time to run. Luckily, my coworker somehow figured out that saving to and loading from a parquet file made certain operations incredibly fast. In particular, my join which originally took several hours before the use of parquet became just several minutes after the use of parquet - if I recall correctly (assuming the same 20 factor difference, 8 hours would reduce to 24 minutes).


Popular posts from this blog

Observable HQ: dropdown input, d3 transition, and viewof

PySpark + Anaconda + Jupyter (Windows)

Getting to know... D3