Converting OMNeT++ vector output files to long-form CSV files quickly using shell tools

OMNeT++, and thus also Veins, writes its recorded simulation results into custom .vec and .sca text files.1 Converting theses files into a more common format has been an ongoing issue. OMNeT++ brings its own scavetool, which changed usage and output formats a lot in recent versions. My research group has its own set of perl scripts for conversion which need to read the whole vector into RAM before writing back, thus failing for large output files.2 What I missed was a simple, fast, and memory-efficient way to convert these vector files in a simple CSV file, without losing any data. So i started to roll my own. But instead of writing my own fully fledged program, I tried to work with UNIX shell tools and pipelines. This turned out to be a splendid idea.

The .vec File Format

But before we get to the solution, let’s look at the format of the .vec file first3. The .vec file is composed out of three types of sections, (first) appearing in that order:

  • the header section, containing the version number of the file format (version 2), some attributes of the simulation (attr), and parameters (param) as well as iteration variables (itervar) set in the config file.
  • the vector declaration section, containing a line for each vector, describing the connection between each vector id, the simulation entity, and the property that vector describes (as well as the string “ETV”, which simply describes the format of the vector data section (Event number, simulation Time, Value)
  • the vector data section, containing the actual data recordings, one per line, composed of the following:
    • the vector id
    • the number of the event in which the data was recorded (what that means is not important here)
    • the simulation time at which the data was recorded (in seconds)
    • the value (of the property described by this vector)

In contrast to the header, there can be more that one vector declaration section and one vector data sections. As memory is limited, result data has to be written from memory to disk at times and not all vectors might be known at that time.

As an example from the example scenario of Veins, the line

vector 0 RSUExampleScenario.node[0].veinsmobility posx ETV

contains the declaration of the vector 0, tracking the x-coordinate of vehicle 0. And the lines

0       11      1       2414.9014220153
0       40      2       2414.9014220153
0       69      3       2412.0282535023
0       98      4       2407.4342728612

Contain some data for that vector, namely at events 11, 40, 69, and 98 or simulation seconds 1, 2, 3, or 4. Notice how the vehicle slowly starts to move.

The Desired CSV Format

The format we want to have our results in is a long-format CSV4 file. Using the long (tidy data) format allows us to keep every line concise and readable by itself.

Each line should contain the following fields, separated by a single tab character:

  1. The vector id (or vector number)
  2. The name of the simulation entity that recorded the value
  3. The name of the property whose value got recorded
  4. The event number at which the value got recorded
  5. The simulation time (in seconds) at which the value got recorded
  6. The actual recorded value

There should not be any quotes in the resulting CSV file.

0	RSUExampleScenario.node[0].veinsmobility	posx	11	1	2414.9014220153
0	RSUExampleScenario.node[0].veinsmobility	posx	40	2	2414.9014220153
0	RSUExampleScenario.node[0].veinsmobility	posx	69	3	2412.0282535023
0	RSUExampleScenario.node[0].veinsmobility	posx	98	4	2407.4342728612

Extracting Vector Declarations and Vector Data

Before we can build one concise CSV file from a .vec file, we first have to split it up according to the three sections mentioned before. The header section can be ignored for now, as it will not get represented in the CSV. The vector declarations and vector data need to be split apart and processed separately. In the end, these two have to be recombined (like in a SQL JOIN) to form the final CSV.

To perform the extraction we use simple grep calls. Vector declarations always start with the word vector and thus can be easily extracted. They might be sorted in the .vec file already, but to be sure, we sort them by the vector id. Finally, we have to replace the spacing between the fields (multiple space characters in the .vec file) and remove the unneeded columns. This can be easily done with sed. (There might be more efficient options, but the number of vector declarations is typically small compared to the amount of vector data, so this will most likely not be a bottleneck).

We can implement all of this as a simple stream filter function:

function extract_vector_definitions() {
    grep '^vector' | sort -k 2n | sed 's/vector\s\+\([0-9]\+\)\s\+\([^ \t]\+\)\s\+\([^ \t]\+\)\s\+[ETV]*/\1\t\2\t\3/'
}

The vector data needs to be extracted from the .vec file as well. We do this by removing all lines that start with header keywords or are empty using grep. Then we sort the remaining data lines, first by vector id, then by event number.

This can also be implemented in a stream filter function:

function extract_data_records() {
    grep -v '^vector\|attr\|version\|param\|run\|itervar' | grep -v '^$' | sort -k 1n,1 -k 2n,2 --buffer-size=5%
}

This is usually the significantly bigger part of the .vec file. Sorting potentially thousands and millions of lines can be accelerated by giving sort more memory to work with. Adjust this to your own needs, if you run into problems here.

The resulting output streams of the two functions could then simply be combined using join -j1. If reading the .vec file twice and working with raw bash functions is okay for you, you can probably stop here. I wanted to go a step further and to all the work in on pass.

Single-stream Processing

Both extract_ functions need to read the whole .vec file. This is fine if you have a file on disk, but cannot be used as a filter in a pipeline. If all we have is a text stream in stdin, we first need to duplicate the stream. I used a temporary named pipe and tee for this. Then, we run extract_vector_definitions and extract_data_records on the two streams and finally join the two.

Aside from that I wanted to support both piping the .vec file into my conversion script as well as providing a file name as an argument.

The result looks like this:

FNAME="$1"
if [[ -z "$1" ]]; then
    FNAME="-"
fi

TMPDIR=$(mktemp -d)
trap "rm -r $TMPDIR" EXIT
mkfifo $TMPDIR/vec.fifo

cat $FNAME | tee $TMPDIR/vec.fifo | extract_data_records | join -j1 <(extract_vector_definitions < $TMPDIR/vec.fifo) -

The complete script is avaliable in the veins_scripts repository of the Veins organization at GitHub.


  1. recent versions alternatively support sqlite, see (https://github.com/mightyCelu/oppsql) for a project to handle those. ↩︎

  2. large meaning single-digit GiB files on a machine with 16 GiB of RAM. ↩︎

  3. .sca files are typically much smaller and thus are not covered here (yet). ↩︎

  4. it is actually tab-separated, but that is just a detail. ↩︎