Guilherme Mendes: 2017

Introduction

Java 8 was released almost three years ago, but it still lacks articles with deeper approach through Stream API. There are some good articles about it, but not a single one showing a real world example and comparing its performance against Java 7 style of coding. This article assumes that the reader already has some knowledge about Stream API, so many simple code will not be explained in here. To start learning about Stream API, I suggest this article, from Benjamin Winterberg.

This article shows a real world alike example, with several different approach of coding, always comparing performance against Java 7. It is result of a study over Stream API's performance when dealing with file processing. The main goal is to use as many Streams and Lambdas as possible into Java 8's code and not to use a single line of Java's new features into Java 7's code, in order to learn how to use Java 8's new features. This project was my first contact with Java 8 at all and I will show how the application evolved to current status. Be ready to see some bad code as well!

The text is divided in sessions and pretty much follows time line from the project's development. All benchmark results are posted at the end and previous sessions talk about how I implemented and improved the project. Please note that the article is long, so please be patient and reserve some time to learn where I made mistakes and maybe to help me find any error I may have implemented in the code.

Input Data

As input data for processing, I used a Brazilian government data file, known as Sigtap (can be found here). It is a zip archive, containing lots of text files, divided into two groups: layout files and data files.

Data files contains positional based information, where each line represents a database register. Layout files contains information about each position in data files. Those files looks very alike a database export, which is the way we'll treat them in this article.

It is important to note that there is a layout.txt file containing all other layout files' information. Let's call this file the general layout. It'll be used later in this article.

There is an example data into Resources' folder inside the project. Please take a look into those files to better understanding about the implementation decisions to write the code.

The Project

The objective is to convert the input data files into SQL inserts. The inserts are not printed or saved anywhere, just generated in memory, because of volume of data. It uses only Java default libraries to process everything, except the benchmarks itself. For that, the JMH library was used.

Because of JMH, the jar of the project excpects a JDK argument called path to run properly. This argument is the path to Sigtap file's extraction folder, for example: '/path/to/Stream_Study/src/main/resources/Sigtap/'.

All code is available into project's GitHub.

Development Phase - Java 8

In this session, I'll present how the code was written, using some subsessions to make reading easily. Although it's not strictly necessary, it's recommended to read the next subsessions with this source file for consulting.

Reading files

The project started with Java 8 implementation. The first surprise was newer Files API. One code line to read all file's lines is impressive! Check out the code inside Image 1.

Image 1 - Java 8's default file reading implementation.

Please nothe that Paths.get may expect the file's encoding. It's important to inform in this case because the input data is encoded with ISO 8859-1. Also note that exceptions are ignored because of example purposes. In this case, an empty List will be returned to prevent NullPointerException to happen.

Processing files

With this very nice start, let's start file processing. The strategy is read the general layout file to detect all tables and its columns. So, let's use a Map, where the key is the table's name and the value is the table's columns, as a List of String. This strategy brings a easy way to work with layout's information and to access desired data from data files.

First nightmare: how to perform this conversion using Stream API? None of default options could do the job. So, let's Google it a bit. After Googling, I came to this StackOverflow. Its accepted answer does the job, but hey! What an Alien code! At this point, I was not ready to understand such a complex code. No problem understanding that it was the implementation of a custom Collector, but the Collector code itself. Note that at this moment, I inserted lots of waste into code.

Image 2 - First version of splitBySeparator method's code...

The strategy is to split the original file's list into sublists (Image 2). The sep argument is String::isEmpty, because I want to split my list when I find an empty line. The custom Collector receives three arguments: a Supplier, an Accumulator and a Combiner. The Supplier object will be returned after a call to Stream's collect using the custom Collector. The Accumulator function (actually a BiConsumer) will be called for each element of the stream and creates an output element. The Collector function (a BinaryOperator) will only be called if this Collector is used into a ParallelStream and is used to combine multiple results from Accumulator into one only result. Note that for serial Stream this function will never be called. The above code is not ready to ParallelStream, but in this case there is no difference, because we need the list to be processed serially, otherwise we won't get the expected result.

Image 3 - ... and how to convert it into a Map

After splitting the list into a List<List<T>>, it is time to convert it into a Map<T, List<T>>, as desired. Let's take a look into Image 3's code. The list argument is the original file's List of lines and the sep argument is String::isEmpty, because I want to split my list when I find an empty line. After calling the collect using the above splitBySeparator Collector, a new Stream is started, filtering empty lists (there are empty lists at this point) and collecting to a Map using the default Collector implementation that generates a Map.

Note that this code is generic enough to receive any List and any Predicate that it'll work. Just note that the first sublist's element will be used as the Map's key, so some adjust may be needed at this point. Save this comment for Java 7's implementation.

Validating

Although it is an example project, it follows real world applications, so it needs some kind of data validation. In this case I choosed to simply compare if general layout and table's specific layout has same content. To do so, the simplest way is to convert the List of columns into a single String taking care of using the same sorting for both files.

Image 4 - A simple validation step.

The Map.Entry in above code is an entry from the Map of general layout. The key is the table's name and the value is the List of this table's columns (with extra column processing information). This snippet just starts a Stream from each List of columns, sorts it and collect to String, using default joining Collector. Then, it throws an Exception if both are not the same. Please note the call of readFile method. It guarantees the equals' argument is from another file.

Generating SQL inserts

The next step is generate the SQL inserts desired. This is the core of processing in this project. At this point the data files are processed using the positional information contained into layout files.

Image 5 - Processing the data files.

The Map.Entry in Image 5's code is an entry from the Map of general layout. The key is the table name and the value is the List of this table's columns (with extra column processing information). The first thing to do is validate the entry using the Image 4's validate method. Once it's validated, it's important to remove some header information from file, to prevent misbehaviour of the application.

Some extra information about the layout now became important: this file's columns can be splitted using a comma, here convenientely replaced by SEPARATOR constant. After splitting the layout, there are only three information positions interesting for this project: the index 0, containing the column name; the index 2, which informs the position to start reading data in data file; and the index 3, indicating the position to stop reading data in data file.

Next, it's the moment to get a list of each column's name. Getting a Stream from layout's List, I mapped each splitted layout line to its 0 index, then sorted the result and then collected using default Collector's toList method.

After extracting column information, it's time to iterate over data file's content. Note that this is the first time I used a parallelStream in entire code so far. I used it here because this is the first point where there is no harm using parallelism and benchmarks showed parallelStream is faster at this point. And it is a major tip: test your own case to see if parallelStream is the best option. I used a forEach loop here to populate a List of Map to represent each line of the file. The Map is a key value representation of data, where the key is the column's name and value is the content of this column in this line of the file. Note that for each file line there is lots of columns so I used a new forEach inside the first, this time into layout's content, to convert each line into the Map I wanted.

At this point, the processedDataList variable is a List of each data file's lines converted into a map of columns to its data. Now I used another parallelStream over processedDataList to in fact generate the insert. After removing null elements, I map the line to the SQL insert text and then collect it to a List, using default toList Collector.

Image 6 - Generating the SQL insert.

To produce the SQL insert, I used code shown in Image 6. There is a StringBuffer to build my final String and two Stream operations inside the method. The first one I already explained before. The second one is the most interesting. Starting a Stream from the list of columns, each column is mapped to its data value. Then I use Java 8's Optional to treat null elements, returning SQL's keyword NULL String when there is no data for that column in the Map. Also, it is important to map each empty value to SQL's NULL String. Then I used the default Collector's joining method to create a String.

Orchestrating everything

Now that I showed how to process the entire file it's time to understand how to orchestrating all this methods calls.

Image 7 - Orchestrating all methods.

Image 7's code shows how to orchestrate everything. This is the entry point method to all processing and it is very simple. First I use readFile to read general layout file. Then I use listToMap to transform the List of the file's lines into a Map where the key is the table's name and value is the List of table's columns.

The next step is to get the Map's entry set, which is an information where I can get a Stream, to start processing. After getting Stream from Map's entry set, it is important to remove null elements. Then, each Map.Entry is converted to process method's output result using flatMap. Flat mapping is an extremely important concept of Stream. It allows you to convert one input line into multiple output lines transparently. In this case, each entry of the Map, representing a table, is converted to a List of Strings, representing all SQL inserts I need to run on that table. Last, it's time to produce the algoritm's output: a List of String. The default Collector's toList method is used here.

Comments over this implementation

The code presented in this session is the result of my first contact with Java 8. It contains lots of problems I'll discuss in the session called Improvement Phase - Java 8. At this point, a good exercise is to understand this code and try find any performance problems to compare with my results to be shown. This way you may produce an even better optimized implementation than mine's.

Development Phase - Java 7

After first implementation using Java 8, it's time to talk about first implementation using Java 7. The first version of Java 7's code was basically a translation of Java 8's implementation, to test performance with a consistent base. Although it's not strictly necessary, it's recommended to read the next subsessions with this source file for consulting.