Gathering data from CSVs into typed parquet files for faster access

When runnig simulations or other experiments, I often end up with a lot of individual CSV files. E.g., one for each parameter variation in a study or replication. This is nice as CSVs are easy to handle, output, and to archive. But processing them for evaluation and plotting becomes cumbersome. Pandas is still kind of memory-hungry when it comes to read data, especially if strings are involved.

This example is taken from my reception-study, a little side-project I started to dig deeper into the intricacies of wireless transmissions.

The old way

Typically, I would:

  1. start a jupyter notebook
  2. write a read_one_csv() function that can read one csv file into a DataFrame
  3. extend that function step by step so the DataFrame has the right data types and columns
  4. glob for all result file names
  5. build a list or generator that applies read_one_csv for all files
  6. pass that collection to pandas.concat to create one large DataFrame

E.g. (for long-format result files of OMNeT++):

def read_one_csv(file_name):
    return pd.read_csv(
        file_name,
        sep=" ",
        names=["module", "statistic", "value"],
        dtype={"module": "category", "statistic": "category", "value": float}
    )

file_names = glob.glob("results/*.sca.csv")
df = pd.concat(
    read_one_csv(file_name) for file_name in file_names
).astype({"module": "category", "statistic": "category", "value": float})

This works okay-ish for small data sets, but can take quite a while once the amount of data gets bigger. Furthermore, as soon as there are strings in the CSVs, e.g., as category flags or identifiers, memory usage deteriorates even more. Normally, I would convert these strings to categoricals using, typically directly in read_one_csv via DataFrame.astype({'colname': 'category'}). But then, pandas.concat most likely would break them again and go back to objects (str). Because as soon as the categoricals are not exactly the same for all DataFrames to be concatenated, the columnt reverts to str. This can easly happen if some categories do not appear in some of the CSV files.

In order to circumvent this, I could build a complete pandas.CategoricalDType in advance. This would contain the set of all categories from all CSV files (for a certain column). But that would mean knowing the all categories in advance. Sometimes, they are known in advance, e.g., for small sets of parameters. But in the general case, they would only be known once all CSV files are read.

Again, this is managable if the data set is small enough and fits into memory multiple times. Because I would have to store both list of individual DataFrames as well as the concatenated DataFrame, at least for some time. Or parse the data twice – on every load. And as our data set becomes large, this may take a while. CSVs are nice and human readable, but not the fastest to parse.

And finally, all of this would happen every time I had to load the data into the notebook. Crashed the kernel? Load again! Rebooted the machine? Load again! Missed some file while loading? Load again!

I hope you get my point.

Typed storage

So CSVs are nice and all, but being able to save and load categoricals directly would be beneficial in such a case. Also not having to look for dozens or hundreds of small CSV files and parse them individually also sounds nice. So I took a look at Pandas' list of supported source files (again).

SQL databases are cool, but I want to stick to my CSVs as persistent storage and just have some nice temporary solution as a sort of cache. Also, setting up da PostgreSQL server means having to make sure another service is running, even if it’s just in a container. And to get some decent performance, I’d have to do a little bit of configuration and tuning as well. Finally, I do my automation with Snakemake. And telling that – or most other make falvors – about a file derived from some CSVs is easy, but having some SQL table in the graph? I honestly don’t know. So I’d like to stick to some simple files that I can just move and delete, thank you very much. (If you want to use SQL and are also working with OMNeT++, check out MightyCelu’s oppsql package to stay away from CSVs in the first place).

So at the time of writing, there are three options: HDF5, parquet, and feather. Feather is really fast, but a bit too temporary and AFAIK, files are not appendable (see below). HDF5 is okay in general, but appending is only possible with the table mode, which in turn has not so great performance. Though check this out if querying data on-disk is something you look for. Parquet, however, ticks all my boxes.

Collecting in advance

What I came up with instead was:

  1. run all my simulations and convert results to CSV
  2. parse the resulting files for all categoricals (for each column, signals and modules in this case)
  3. collect all reasults – with categoricals – into one parquet file
  4. do evaluations in jupyter as I like

Step 3 of this looks something like this:

# `input` is passed from Snakemake here, but could also be a command line argument

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# set up categoricals and data types
modules_dtype = pd.CategoricalDtype(
    pd.read_csv(input.modules_file, header=None, names=["modules"]).modules,
    ordered=True,
)
signals_dtype = pd.CategoricalDtype(
    pd.read_csv(input.signals_file, header=None, names=["signals"]).signals,
    ordered=True,
)
columns = ["vecid", "module", "signal", "event", "time", "value"]
dtypes = {
    "module": modules_dtype,
    "signal": signals_dtype,
    "vecid": int,
    "event": int,
    "time": float,
    "value": float,
}

# iteratively parse and write to parquet
pqwriter = None
for runnr, file_name in enumerate(input.result_files):
    df = pd.read_csv(file_name, sep=" ", names=columns, dtype=dtypes).assign(runnr=runnr)
    table = pa.Table.from_pandas(df)
    # for the first chunk of records
    if pqwriter is None:
        # create a parquet write object giving it an output file
        pqwriter = pq.ParquetWriter(output[0], table.schema)
    pqwriter.write_table(table)

if pqwriter is not None:
    pqwriter.close()

Note that I read only one CSV file at the time here and the write it to the parquet file. All while using the matching categorical data types from the reading. This makes the whole process very memory efficient. For very large csv files, this could even be done iteratively by using iterator=True and chunksize. Or, to speed things up even further, I could have done the CSV to binary (e.g, parquet or feather) conversion in per-file jobs and then combined them in a final step – but for now this was fast enough for me.

The code you see above is just part of a run: section of a Snakemake rule. Snakemake provides the input files. This also has the benefit that I can generate a complete result file, potentially including the simulations themselfs, by just running snakemake with that rule.

Reading

Reading the data in the notebook than is a simple:

import pandas as pd

df = pd.read_parquet("results/all.parquet")

And all categories are in, while memory footprint stays low.