CPU | Tushar Shinde

If you really want to hear about it, the first thing on work day morning you’ll probably want to know how your last day’s work performed. Well I was lucky this morning hearing a big applause for winning yet another hackathon at client location. In hackathon you build something useful in 8-24 hrs . Facebook’s like button was built in one of the hackathon.

I chose a topic to showcase “how a right choice of technology and programming style can increase the application throughput exponentially“. First question came to my mind was how would I demonstrate that something I built is better, I realize that I need something for comparison. I should build something using traditional way, measure the throughput and then build exactly same thing the I way I think is better and compare their results. That’s it, My instinct told me this is the way to go 🙂

I built an application to mock OAUTH2.0 response of Facebook. I kept it very simple, one validation to check if access token exist in a request and then flush the dummy response. The source code is here on github.

I then built a same functionality using “play & scala”, get source code here. I chose “play” because it uses Netty- non blocking server at backend which I thought will boost my throughput further along with style of programming (async & non-blocking) I want to showcase. “play” also allows hot-swapping (changing & testing code without restart). And “scala” because it is easy to write concise and asynchronous non-blocking code with it and I thought it will also shed some light on practical explanation of my previous blog- multicore crisis.

I then wrote JMeter test script to put a load on both applications. Of this script I first ran “facebook_imperative” test which puts load on traditional web application deployed on tomcat which has 300 threads configured in server.xml. I used 400 JMeter threads to shoot up concurrent request for one second. Following is the result of test. As expected application throughput is around 544 request per second with 3+% error rate. The errors are because tomcat could not handle these many requests and some of them got timeout. The error is obviously something which I would want to avoid because it compromises application reliability.

traditional_sync_test

I wrote exactly same functionality in play & scala and ran the similar test “facebookmock_sync” of above script. The execution context for this was also using exactly 300 threads (check FacebookMock.facebookSync & config.ConfigBlocking.executionContext). The JMeter used 400 threads to shoot up requests.

play_scala_sync_test

Now compare the results. There are no errors in “play & scala” app and the throughput is 6000+ requests per second ✌. That means 12 times increased throughput. If I can take some liberty to attach some meaning to it then it means I can serve 12 times more customers with same infrastructure and with much quicker response time (look at Average). It also means I need much smaller tech ops teams and I will have more happy customers with lesser expense/investment.

I decided to take this further and thought of adding some real simulation. Not all our calls are non-blocking like this mock response. Some of them may be blocking calls e.g. JDBC calls. I simulated this by putting delay of 1 second using Thread.sleep before responding. That’s the way we usually write code just block the thread and wait for the result:( .

I ran the blocking test “facebookmock_blocking_Imperative” on traditional application. For this I have same application deployed on tomcat but now “doGet” method was using blocking method of mock response. The similar test “facebookmock_async” ran on play app.

traditional_blocking_test

play_blocking_test

As you can see in above figures, the error rate(18%) in traditional app is unacceptable. So throughput shown in figure ‘traditional_blocking_test’ is false. Where as there are still no errors in play app. The difference in coding is; I released the server thread as early as I can using Future & Promises. In fact I tried blocking test even with 1000 JMeter threads and traditional apps gives 50% error rate whereas play app still survives with 0% error.

The best part is here. I converted the blocking call into non-blocking using Scheduler. I used only 1 thread of a pool this time. Check Scheduler.scala & Config.scala. And wow! play app still works with just one thread and even little better results. That’s really awesome!

play_non-blocking_test

I always believed that programming is an art and one should handle it with extreme delicacy & respect to craft out marvelous peace which you want world to admire. This hackathon provided me one more opportunity to follow and showcase it.

On one fine morning of last November I was reading this article “The Free Lunch Is Over” on the coffee table. When trying to the understand article, I found myself regretting that I never took “Functional Programming” seriously until then. I realized that all these days I was writing sequential programs and never tried to parallelize them. I was so much focused on memory utilization that I had almost always taken CPU performance granted and had left it to chip designers to make my program run faster.

As shown in following graph which is taken from the Herb Sutter’s above article, the CPU cycles per sec (dark blue) are not increasing exponentially as was the case till year 2005. Moore’s law may be true for exponential growth of transistors but CPU cycle are becoming linear. That means number of instructions processed by one CPU per cycle have now hit the wall. Instead; chip designers are adding more cores, on chip cache, hyperthreading and read/write optimization to support more processing demand.

However the way most of us write programs keep these extra cores either idle or consumed for running spyware and malware 🙂 Have a look at below CPU utilization of my quad core processor while my JVM is up. We can see that only one core is being used and that too hardly to its 50% of capacity 😦 which in total, for 8 cores counts down to less than 10%

In fact, addition of each core is going to make our imperative style programs slower due to heat and power reasons. It’s going to be hard to explain to the client that despite addition of more cores to the machine our programs are not running faster and utilizing only 20% of the capacity.

Well never the less! That’s how I got pulled towards functional programming and spent sleepless nights in last three months understanding its core concepts, linking knots together and presenting topic to client team to make sure that I understand it to the level to convince people and it paid off, it well paid off.

I used various dimensions explained by experts to understand where each programming languages stand. You can add other dimensions to below table like “honesty about side effects, commercial value, popularity etc to include C#, F#, C++, Haskell, Erlang in the game.

	Java	Scala	JRuby	Clojure
Typing	Static	Static	Dynamic	Dynamic
Paradigm	OO	OO/FP	OO	FP

I choose “Scala” as a language to understand core functional programming concepts because “Scala” is like radio dial which has OO at its one end and FP on the other end. You can tune to the level you are comfortable with and keep adjusting the dial as you learn more; previous work experience in “Groovy” certainly helped me to understand concepts quickly.

The first and foremost benefit I observed by using functional language was its honesty about side effects. I then admire its following features

Concise code. Imagine how much code we would write in C# or Java to achieve the same
1. val someNumbers = List(1, 2, 3, 4, 5, 6, 7, 10, 34, 46, 75, 100)
2. val onlyEven = someNumbers filter( _ % 2 == 0 )
3. val onlyOdd = someNumbers filter( _ % 2 != 0 )
4. val onlyMoreThan25 = someNumbers filter( _ > 25 )
Function are first class citizen, are higher order, closures, partial function and Currying
1. def f(x: Int) = x * 2
2. def g(y: Int) = y + 2
3. you can compose functions like f(g(2)) which would give the result 8
TypeInference
1. Map[Integer, String] employee = new HashMap<Integer, String>() This is in java. Didn’t we tell the type in first part of the above line?
2. val capital = Map(“US” -> “Washington”, “France” -> “Paris”) Scala is statically typed but infers the type wherever it can which mean you ‘type’ less 🙂
LazyEvaluation
Control Abstraction
Pattern Matching and Extractors
XML
Traits
AKKA & Concurrency
Modular Programming
Tailcall Optimization
Parallel Collections

Each one of them actually deserves their own blog hence I could not provide examples of for all of them.

To just prove my understanding I took one of my old Java code in which I was processing incoming compressed file which in turn contains multiple files. I was processing these files inside “for loop” like below which was taking 7.5 sec to complete and only one out of the 8 cores was being used because of sequential style of processing.

for(File currentFile: UncompressedFiles)

process(uncompressedFile);

Then I wrote the same processing using, AKKA actor system. I created the “8” actors to process one file each. The below graph explains that all 8 cores got utilized and processing was completed within 2.5 sec. I was quite astounded with the results and I will certainly showcase it to the client.

val m = system.actorOf(Props[WordCountMaster], name=”master”)

m ! StartCounting(“src/main/resources/”, 8)

Right now, I plan to set the book aside, leaving it on my coffee table to see where in daily work I can use it to ease the development efforts, write less code, and improve performance. If it works out, then I may return and study it further to pick up from where I am now. I’m so happy to have this insights that certainly made me better programmer. How do you find this topic? Leave the comments and I will see what I can put more lights on.

Tushar Shinde

Tag Archives: CPU

Increase throughput with async and non-blocking calls

Multicore crisis and Functional Programming