Thursday, February 2, 2017

Java 8 Streams - A Deeper Approach About Performance Improvement

Introduction

Java 8 was released almost three years ago, but it still lacks articles with deeper approach through Stream API. There are some good articles about it, but not a single one showing a real world example and comparing its performance against Java 7 style of coding. This article assumes that the reader already has some knowledge about Stream API, so many simple code will not be explained in here. To start learning about Stream API, I suggest this article, from Benjamin Winterberg.

This article shows a real world alike example, with several different approach of coding, always comparing performance against Java 7. It is result of a study over Stream API's performance when dealing with file processing. The main goal is to use as many Streams and Lambdas as possible into Java 8's code and not to use a single line of Java's new features into Java 7's code, in order to learn how to use Java 8's new features. This project was my first contact with Java 8 at all and I will show how the application evolved to current status. Be ready to see some bad code as well!

The text is divided in sessions and pretty much follows time line from the project's development. All benchmark results are posted at the end and previous sessions talk about how I implemented and improved the project. Please note that the article is long, so please be patient and reserve some time to learn where I made mistakes and maybe to help me find any error I may have implemented in the code.

Input Data

As input data for processing, I used a Brazilian government data file, known as Sigtap (can be found here). It is a zip archive, containing lots of text files, divided into two groups: layout files and data files.

Data files contains positional based information, where each line represents a database register. Layout files contains information about each position in data files. Those files looks very alike a database export, which is the way we'll treat them in this article.

It is important to note that there is a layout.txt file containing all other layout files' information. Let's call this file the general layout. It'll be used later in this article.

There is an example data into Resources' folder inside the project. Please take a look into those files to better understanding about the implementation decisions to write the code.

The Project

The objective is to convert the input data files into SQL inserts. The inserts are not printed or saved anywhere, just generated in memory, because of volume of data. It uses only Java default libraries to process everything, except the benchmarks itself. For that, the JMH library was used.

Because of JMH, the jar of the project excpects a JDK argument called path to run properly. This argument is the path to Sigtap file's extraction folder,  for example: '/path/to/Stream_Study/src/main/resources/Sigtap/'.


Development Phase - Java 8

In this session, I'll present how the code was written, using some subsessions to make reading easily. Although it's not strictly necessary, it's recommended to read the next subsessions with this source file for consulting.

Reading files

The project started with Java 8 implementation. The first surprise was newer Files API. One code line to read all file's lines is impressive! Check out the code inside Image 1.

Image 1 - Java 8's default file reading implementation.

Please nothe that Paths.get may expect the file's encoding. It's important to inform in this case because the input data is encoded with ISO 8859-1. Also note that exceptions are ignored because of example purposes. In this case, an empty List will be returned to prevent NullPointerException to happen.

Processing files

With this very nice start, let's start file processing. The strategy is read the general layout file to detect all tables and its columns. So, let's use a Map, where the key is the table's name and the value is the table's columns, as a List of String. This strategy brings a easy way to work with layout's information and to access desired data from data files.

First nightmare: how to perform this conversion using Stream API? None of default options could do the job. So, let's Google it a bit. After Googling, I came to this StackOverflow. Its accepted answer does the job, but hey! What an Alien code! At this point, I was not ready to understand such a complex code. No problem understanding that it was the implementation of a custom Collector, but the Collector code itself. Note that at this moment, I inserted lots of waste into code.

Image 2 - First version of splitBySeparator method's code...

The strategy is to split the original file's list into sublists (Image 2). The sep argument is String::isEmpty, because I want to split my list when I find an empty line. The custom Collector receives three arguments: a Supplier, an Accumulator and a Combiner. The Supplier object will be returned after a call to Stream's collect using the custom Collector. The Accumulator function (actually a BiConsumer) will be called for each element of the stream and creates an output element. The Collector function (a BinaryOperator) will only be called if this Collector is used into a ParallelStream and is used to combine multiple results from Accumulator into one only result. Note that for serial Stream this function will never be called. The above code is not ready to ParallelStream, but in this case there is no difference, because we need the list to be processed serially, otherwise we won't get the expected result.

Image 3 - ... and how to convert it into a Map

After splitting the list into a List<List<T>>, it is time to convert it into a Map<T, List<T>>, as desired. Let's take a look into Image 3's code. The list argument is the original file's List of lines and the sep argument is String::isEmpty, because I want to split my list when I find an empty line. After calling the collect using the above splitBySeparator Collector, a new Stream is started, filtering empty lists (there are empty lists at this point) and collecting to a Map using the default Collector implementation that generates a Map.

Note that this code is generic enough to receive any List and any Predicate that it'll work. Just note that the first sublist's element will be used as the Map's key, so some adjust may be needed at this point. Save this comment for Java 7's implementation.

Validating

Although it is an example project, it follows real world applications, so it needs some kind of data validation. In this case I choosed to simply compare if general layout and table's specific layout has same content. To do so, the simplest way is to convert the List of columns into a single String taking care of using the same sorting for both files.

Image 4 - A simple validation step.

The Map.Entry in above code is an entry from the Map of general layout. The key is the table's name and the value is the List of this table's columns (with extra column processing information). This snippet just starts a Stream from each List of columns, sorts it and collect to String, using default joining Collector. Then, it throws an Exception if both are not the same. Please note the call of readFile method. It guarantees the equals' argument is from another file.

Generating SQL inserts

The next step is generate the SQL inserts desired. This is the core of processing in this project. At this point the data files are processed using the positional information contained into layout files.

Image 5 - Processing the data files.

The Map.Entry in Image 5's code is an entry from the Map of general layout. The key is the table name and the value is the List of this table's columns (with extra column processing information). The first thing to do is validate the entry using the Image 4's validate method. Once it's validated, it's important to remove some header information from file, to prevent misbehaviour of the application.

Some extra information about the layout now became important: this file's columns can be splitted using a comma, here convenientely replaced by SEPARATOR constant. After splitting the layout, there are only three information positions interesting for this project: the index 0, containing the column name; the index 2, which informs the position to start reading data in data file; and the index 3, indicating the position to stop reading data in data file.

Next, it's the moment to get a list of each column's name. Getting a Stream from layout's List, I mapped each splitted layout line to its 0 index, then sorted the result and then collected using default Collector's toList method.

After extracting column information, it's time to iterate over data file's content. Note that this is the first time I used a parallelStream in entire code so far. I used it here because this is the first point where there is no harm using parallelism and benchmarks showed parallelStream is faster at this point. And it is a major tip: test your own case to see if parallelStream is the best option. I used a forEach loop here to populate a List of Map to represent each line of the file. The Map is a key value representation of data, where the key is the column's name and value is the content of this column in this line of the file. Note that for each file line there is lots of columns so I used a new forEach inside the first, this time into layout's content, to convert each line into the Map I wanted.

At this point, the processedDataList variable is a List of each data file's lines converted into a map of columns to its data. Now I used another parallelStream over processedDataList to in fact generate the insert. After removing null elements, I map the line to the SQL insert text and then collect it to a List, using default toList Collector.

Image 6 - Generating the SQL insert.

To produce the SQL insert, I used code shown in Image 6. There is a StringBuffer to build my final String and two Stream operations inside the method. The first one I already explained before. The second one is the most interesting. Starting a Stream from the list of columns, each column is mapped to its data value. Then I use Java 8's Optional to treat null elements, returning SQL's keyword NULL String when there is no data for that column in the Map. Also, it is important to map each empty value to SQL's NULL String. Then I used the default Collector's joining method to create a String.

Orchestrating everything

Now that I showed how to process the entire file it's time to understand how to orchestrating all this methods calls.

Image 7 - Orchestrating all methods.

Image 7's code shows how to orchestrate everything. This is the entry point method to all processing and it is very simple. First I use readFile to read general layout file. Then I use listToMap to transform the List of the file's lines into a Map where the key is the table's name and value is the List of table's columns.

The next step is to get the Map's entry set, which is an information where I can get a Stream, to start processing. After getting Stream from Map's entry set, it is important to remove null elements. Then, each Map.Entry is converted to process method's output result using flatMap. Flat mapping is an extremely important concept of Stream. It allows you to convert one input line into multiple output lines transparently. In this case, each entry of the Map, representing a table, is converted to a List of Strings, representing all SQL inserts I need to run on that table. Last, it's time to produce the algoritm's output: a List of String. The default Collector's toList method is used here.

Comments over this implementation

The code presented in this session is the result of my first contact with Java 8. It contains lots of problems I'll discuss in the session called Improvement Phase - Java 8. At this point, a good exercise is to understand this code and try find any performance problems to compare with my results to be shown. This way you may produce an even better optimized implementation than mine's.

Development Phase - Java 7

After first implementation using Java 8, it's time to talk about first implementation using Java 7. The first version of Java 7's code was basically a translation of Java 8's implementation, to test performance with a consistent base. Although it's not strictly necessary, it's recommended to read the next subsessions with this source file for consulting.

Reading files

In this version, I used most common implementation to read files: Java 6 style. Please note the amount of extra code, comparing to Java 8's version.

Image 8 - Java 6's style to read files.

Processing files

The same strategy of Java 8's version was used at this point: create a Map where the key is the table's name and value is a List of table's columns. No problem to write the Java 7 version, so no Googling needed. Please note that this method is much more readable. Also, there is no need to implement a two method solution, only one can do the job.

Image 9 - Converting List to Map. Much easier to understand!

Reading the code, the first loop does the same thing the custom Collector does in Java 8's version and the second loop generates the output Map, just like Java 8's default Collector's toMap method. Note that this implementation also has problems with empty Lists.

Pay attention that this code is not completely generic and only Strings can be processed without modification. I choosed not to use Java 8's functional interface Predicate here to stay strictly into Java 7's default code style, exactly to show where Java 8 is better or worse, speaking of readability and usability.

Validating

Following Java 8's strategy again, let's validate input's content comparing the String generated from columns' List from both general layout and each table's layout file.

Image 10 - Validation by converting List into String.

Comparing to Java 8's code in Image 4, there is plenty more code in here. It's easy to understand, but Stream implementation also is easy to read. Here I use two for loops to convert each List into a String and then I check if both are equal. An inconvenient here is the need to remove the last SEPARATOR from output String, made through substring calls. I choosed removing last SEPARATOR's character to keep the method's comparing objects exactly the same as generated by Stream solution and keep consistency.

Generating SQL inserts

As of Java 8's version, here is the core of Java 7's implementation. Read carefully the Image 11's code.

Image 11 - Java 7's style to processing data files.

This is the most similar code snippet between versions. Almost every line of this code does the same thing as Java 8's version. Main difference is that I don't use parallelism in here, just sequential processing.

Image 12 - Generating the SQL insert.

Generating the SQL insert with Java 7 is a lot more code to write, but it is easy to read. The same approach here: a StringBuilder to construct the output String, two loops, one for column list and other for data list. Same problem here to remove final SEPARATOR character. It is really annoying, but does the job.

Orchestrating Everything

Again let's talk about how to orchestrate everything shown until here on Java 7's version.

Image 13 - Orchestrating all methods.

Again, very similar code between versions. In fact, the only differences are the listToMap specialized implementation (without a Predicate argument) and the loop instead of Stream's flatMap. This method also is the entry point to all processing. Java 8's version uses less code to do the same processing, but this code is easier to understand.

Comments over this implementation

This session's code is just a translation of Java 8's version. In fact, any methods can be replaced from both implementations. The only real difference over methods' interfaces is the use of Predicate as separator function on Java 8's listToMap, something that Java 7 can't provide to you and that can be very usefull when reusing code. Also is clear that Java 8 can be used to produce a lot less code instead of Java 7.

At this point, there is a lot of waste code on both implementations. As spoiler, this Java 7 implementation is faster than Java 8's. More about it into Benchmark Results' session, after discussing improvements to each version. Again I recommend understanding all code shown in this session and try to remove waste code and improve performance.

Improvement Phase - Java 8

At this session, the main goal is to show how the code evolved, after studying more and rethinking about the problem. Here I'll talk just about Java 8's implementation improvements and next session will talk about Java 7's version. Again, the sequence of facts is how I really developed the project.

I developed two major evolutions using Java 8 and I'll call them V2 and V3, to simplify writing and understanding. I'll refer first version as V1 when needed. Same subsessions from above sessions will be used to show how each part evolved. Skipped subsessions has not been changed since previous version.

Java 8's V2

This version removes lots of waste code and improves readability. Although it's not strictly necessary, it's recommended to read the next subsessions with this source file for consulting.

Processing Files

At this subsection, I tried lots of different implementations to convert a List into a Map. I'll focus talking about the choosen implementation, but this file contains other alternatives. All alternatives will be discussed into Benchmark Results' session.

I replaced the two step custom Collector approach in favor to using a reduce implementation. This approach also came from this StackOverflow. It provides less and simplified code, easier to understand and with a better performance.

Image 14 - Second version of listToMap using Java 8.

Note that the reduce's arguments are very similar to custom Collector's arguments. Some improvement was made to simplify the code, but it is pretty much the same code written in Image 2. Note that listToMap method will not handle properly if called into a parallelStream because the input List must be processed serially.

Unfortunately, this code is still very complex to understand and may cause maintenance problems if used into real world project.

Validating

No big changes here, I just created two variables to improve readability. It's now much easier to understand what it does rather than Image 4's code.

Image 15 - New validate method. Easier to understand.

Generating SQL inserts

Lots of improvements here. First, all synchronized stuff has been removed to simplify and increase readability. After, one less Stream processing step is needed, joining last two parallelStreams into only one parallelStream. It means one less iteration over results and more performance!

Image 16 - Clearer way to generate inserts. Compare it with the Image 5.

One important change is the use of a Supplier variable. It is very usefull when there is need to reuse a Stream. Java's Streams are not reusable, once it is processed with a terminal operation, the Stream can't be used again. But we still can set a Stream to a variable in order to reuse code using the Supplier functional interface. Look into columnSupplier variable's accesses. It has one get method which returns a new instance of the desired Stream each time it's called. It helps the readability of the code.

Also note that this method now returns a Stream instead of a List. Java's Streams are only processed when a terminal operation is called, such as forEach, collect and reduce. The less calls needed to terminal operations the better, because this way Java can optimize code's execution time. More about how this can optimize your code's execution time on next session. Please come back and take a look into Image 7's code. Flatmap needs that lambda's output to be a Stream. The V1's collect approach was very wrong, since the process method's output List was only used to turn into a Stream again.

Image 17 - Easier way to generate the SQL insert.

Generating the insert text also was improved. Compare it to Image 6's code. Now I used a joining Collector's different constructor to inform a prefix and a suffix to output String. Note that baseInsertText variable's value does not changes for each processed data line. Now it's generated once for each table and reused for each data file's lines. Less code to process, more performance.

Orchestrating everything

Only change is to not need to start a Stream inside flatMap because process method already returns a Stream.

Image 18 - Now I can use Java 8's method reference instead of lambda inside flatMap.

Comments over this implementation

After this improvement round, the code became more concise and easy to understand. Lots of wasting operations were removed. Looking at the code, doesn't seems to be such a huge improvement, but believe me, it is. This version made me start to think about how to develop considering the use of Stream API and helped me understand how this new world about Java programming works.

Java 8's V3

This version uses as much operations as possible using the same Stream in order to reduce terminal operations to be processed. It's important to maximize the amount of operations into the same Stream because it reduces the amount of looping performed over data. As Streams are lazy executed, Java executes all operations over data inside the same looping when a terminal operation is called, performing all transformations needed for each iteration. Note that writting a code with same behaviour may easily became a mess, while Streams are not that hard to understand what's going on. Although it's not strictly necessary, it's recommended to read the next subsessions with this source file for consulting.

Reading files

This time, another Java 8's default approach was used: reading a file directly inside a Stream. It means file is actually read when a terminal operation is called, so all file data are processed accordinly to its Stream's operations before it turn into an Object.

Image 19 - Reading a file to a Stream.

Note the code is as easy to implement as the code to reading all file lines. It is important to return an empty Stream instead of null because it prevents NullPointerExceptions to happen. Why this is better to my scenario will become clearer in next subsessions.

Processing Files

In order to use the longest Stream possible, the approach has been changed here. Now I don't want a Map anymore, instead a List of Lists will be more appropriate. This change eliminates one Collect operation which was used to generate the map.

Image 20 - New way to split data returning a Stream.

The differences between this version and V2's code on Image 14 are the list argument of type Stream instead of List, the new method name as it does not convert the List into a Map anymore and the returning type was changed to Stream instead of Map. Please note that the first terminal operation is used here when calling the reduce method. I didn't have found a way to eliminate this operation. At this point, the input file was read and processed, now turned into a List of List, and a new Stream is started.

Validating

Following the changes, validation step was also modified. This time the approach of validation is completely different in order to reduce data manipulation a bit.

Image 21 - New strategy to validate data.

As there is no Map anymore, method arguments are now the tableName's String and the layoutList's List, replacing the Map.Entry argument. Please note that same information was sent to method. The new approach is to compare two List instead of two String. Please note that is necessary to compare each list through containsAll instead of equals to avoid sorting each List. But using containsAll requires comparing twice the Lists or the result may not be correct. Comparing just one List to the other will not detect when first List has more elements than second one.

Generating SQL inserts

There are some changes at the core processing also. The argument of process method has changed because there is no Map.Entry type. Now it receives a List instead of Map.Entry.

Image 22 - Generating inserts using as many Streams as possible.

The main change is that now the Stream returned is actually the file's Stream with some mapping operations. This way the data file is not actually accessed inside this method because a Stream is returned. The generateInsert method didn't receive modifications.

Orchestrating everything

As every other methods have changed, the execute method naturally has modifications. Actually it became simpler and more readable.

Image 23 - New way of orchestrating method calls.

Note that methods splitList, readFile and process now return Stream. This simplifies the implementation a bit. It is important to note that splitList actually starts a new Stream because readFile's Stream is reduced to a List of List inside splitList. Also is important to note that process method's output Stream is started while reading data file, which means each data file will be read when final collect method is called.

Comments over this implementation

This implementation was made to really use Stream's full potential. It is extremely important to note that when I wrote Java 8's first version of this code, I was not capable of develop something like this version, because I was not thinking about writting Java 8 code, only the old Java 7 way. I haven't find any other optimization to apply into this code, but certainly something may could be optimized yet. Anyway, I'm very happy with performance results over this implementation.

Improvement Phase - Java 7

Java 7's version of this project also has been improved, again taking base at Java 8's V2 implementation. But the second version of Java 7's code is not just a translation of Java 8's second version, some other improvements has been applied too. There is only one improvement version based on Java 7, called V2. First version will be reffered as V1.

Java 7's V2

This version brings all shareable optimizations used in Java 8's V2 plus some exclusive changes. Although it's not strictly necessary, it's recommended to read the next subsessions with this source file for consulting.

Reading files

This time, I decide to use Java 7's default implementation for reading files, using try-with-resource. It shortened a lot the code, also making it more readable.

Image 24 - Java 7's style for reading files.

Compared to Image 8's code, there's a lot of improvement here on readability. Please note that multiple try blocks are not strictly necessary, but to use only one block implicates constructing objects passing a constructor as argument, which reduces readability. The most advantage of try-with-resource is that it's not necessary closing resources initialized inside the parenthesis.

Processing files

This subsection contains a lot of improvement, both in performance and readability. A new strategy was used to convert a List into a Map. The new approach needs only one loop over input List to generate the Map. There is an extra attempt to improve this section that won't be discussed in this topic, only in Benchmark Results' session. The code can be found here.

Image 25 - New algorithm to convert a List into a Map.

Note how is possible to construct efficiently the desired Map using a simple flag. This solution also has no problems with empty Lists. This code is much more readable than Java 7's V1 implementation and much much more readable than Java 8's versions.

Validating

No significantly changes has been made into validate method in terms of performance. But readability has been very improved with use of a helper method to convert a List into a String using a default separator.

Image 26 - More readable validation method.

Comparing to Image 10's code, the main change is just the use of a helper method for grouping List into a String. It's very clear how the code became cleaner and now there is some reusability.

Image 27 - New way to convert a List into a String allows code reusing.

The helper method has no changes comparing with Image 10's code that it replaces.

Generating SQL inserts

There are some improvements here already known from Java 8's V2, such as eliminating one loop over data and processing only once the SQL insert's base text.

Image 28 - New way to processing data files.

Some waste code was removed, but general structure of this method was not changed. Compare it with Image 11's code, there's not so much changes in here.

Image 29 - Optimized way to generate SQL inserts.

Note that generateInsert method has evolved a lot. Reusing baseInsertText variable's content simplified very much this method. Compare it with Image 12's code. Both readability and performance became better at this point.

Comments over this implementation

The major improvement into this version was the listToMap method. Other changes were just about removing unnecessary code, while listToMap is a really new code. But even with few changes, all waste code was decreasing significantly how performant this implementation can be. I hasn't found any new improvements to apply into this code at the moment and certainly some improvement may still be possible. Anyway, I'm very happy with the performance results for this implementation.

Benchmark Results

Now that all code has been properly explained, let's see how each version performate in a consistent test. As already told, I used JMH library to benchmark all code. It makes test results consistent and provides serious metrics. All benchmark code is availiable in this file and will not be explained here.

Image 30 - Benchmark results!

Image 30 contains all benchmark results implemented in this project. All them will be properly explained in next subsessions in details, properly splitted to better readability. About table's columns: Benchmark column displays benchmark's name; Mode column displays which test this results refers - avgt means Average Time; Cnt column displays how many tests was performed to compute the score; Score column is the test's result and the most important column to compare; Error column is the error margin of the test; Units column is the measurement unit of Score and Error columns.

Please note all benchmarks here used the Average Time method. Also, note that Benchmark name explains which version of Java is used and which version explained here the test refers. Also note that benchmark results are machine dependant, so running in different machine will produce different results.

Read file results

Image 31 - Benchmark results from reading files.

Let's start talking about reading files. Both Java 7's implementations has exaclty the same performance. Java 8's first and second implementations shows pratically the same results, a little bit slower than Java 7's versions, and Java 8's V3 implementation is very faster than anything! Why? Because readFile's implementation in Java 8's V3 just reads the file into a Stream, which is just closed here, with no processing realized. In fact, this test only shows how much time takes to prepare a Stream, not to read the file itself.

Analyzing the results, the verdict is to use Java 8's versions, V3 if a Stream is required or V1 if a List is required, except if performance is a very critical issue to your application and file reading operations are very common. Otherwise, Java 8's facility to write and read and the little difference between results are really big convincing arguments.

Conversions from List to Map

Image 32 - Benchmark results from converting List into Map.

Please note here that some benchmarks does not refers to one version explicitly. Those tests are described partially back in Improvements' sessions as partial improvement code. Note also that Java 8's V3 splitList method is not tested here. It happens because this method throws lots of exceptions of to many open files while running benchmark tests, even collecting Stream result.

Looking into Java 7's results, the partial study method using edges approach, not shown in this article, looks faster, but its more complex to understand code and its greater error margin was decisive to choose V2's implementation. Because of error margin, in past tests the results were different, with V2 implementation faster.

Now looking into Java 8's results, the faster implementation is definitely the one using ForEach approach. This version is basically Java 7's V2 replacing traditional for loop by Java 8's forEach call and lambda argument. This code works, but has some inconvenient global variables because lambda can't access local non final variables. That is why I discarted this solution. Talking about the edges' approach, the same argument from Java 7's edges approach not has been choosen is applicable. V2's implementation is as fast as edges approach, but shows more stable results.

In this case, Java 7's code is faster than Java 8's implementation. Also, the second Java 7 version's code is more easy to understand, so it's better to use Java 7's style at this point. Although Java 8's version has a very close performance, it lacks in readability.

Full processing

Image 33 - Benchmark results from all processing presented.

Surprise! Java 8's V3 optimized implementation is faster than Java 7's V2 optimized implementation?! Yes! But please notice that Java 8's V3 has lots more error margin than Java 7's V2. It means at long term, Java 7's V2 can be faster, or even that is more likely that Java 7's single shot calling may be faster than Java 8's single shot calling. I prefer to say both results are in a tie. But hey! That's awesome! It means that Java 8's Stream can be as fast as Java 7's default concepts and this is a great result!

Looking towards other results, it's impressive how much performance can be obtained by just removing waste code. It also shows that Java 8's Streams need to be correctly written to obtain satisfatory performance.

Conclusion

This project shows a bit about Java 8 Stream's optimization. It's clear that is very easy to made a mistake when working with Streams and it may slow down your application's performance in a concerning way. It just means that Stream processing must be very carefully analyzed before releasing the code.

It is important to note that a real world application would use a hybrid approach, mixing Java 7's concepts with Java 8's new Stream processing, and not this completely heterogeneous solution as shown here. This article is the proof that Streams can be used alone, but Java's default style is still easier to use and usually has better performance.

Please use comments box or Github's issues/pull requests to interact about the results and suggest improvements. This way, all of us learn together how to improve development with Java 8's new features.

Thank you for your time!