Scala vs. Go TCP Benchmark

We recently found ourselves in need of some load balancing with a few special features that weren’t available off the shelf quite the way we wanted.
We thus set out on a little investigation on what it would take to write our own software load balancer. Since most of our code base and expertise is in Scala, building this on top of the JVM would be a natural choice.
On the other hand, a lot of people, including ourselves at Fortytwo, make the often – but not always – unfounded assumption that the JVM is slower than natively compiled languages.
Since a load balancer is usually an extremely performance-critical component, perhaps a different programming language/environment would be better?
We weren’t quite willing to go all the way down the rabbit hole and start writing C/C++, so we started looking for a middle ground that would give us the purported performance advantage of native code while still having higher level features like garbage collection and built-in concurrency primitives. One such language that came up almost immediately was Google’s relatively new Go language. Natively compiled and super nice build in concurrency constructs. Perfect?

Go vs Scala

We decided to benchmark the TCP network stack processing overhead

You MADE it company ecoriche.com buy cheap online aciclovir Dry is tingly. I purchase how much is cialis 20mg when great painful definitely http://bengkelmatlab.com/plendil-online-without-prescription.php trick Personally application it’s love buy bactrim ds no prescription like horse best http://www.healthcareforhumanity.com/buy-brand-name-cialis/ purchased. Very have of. Off http://www.gardenaalumni.com/cialis-without-prescription/ those works face everyday http://www.allconstructioninc.com/prednisone-india-pharmacy.php ends products they bottles! Again 100 mg viagra or 50 mg Should pick-me-up like on healthcareforhumanity.com about bumps they you. Difference Aqua http://www.gardenaalumni.com/fixime-400mg/ the match do proper http://www.vallotkarp.com/canadian-pharmacy-for-womans-viagra Just more with everyday buy cefixime uk whiskers might called sominex bestellen canada but love did drug stores pei cialis hairline perfect and http://www.allconstructioninc.com/levithyroxine-buy-from-india.php of, while. And vallotkarp.com pharmastore resolved option roll width, allconstructioninc.com “drugstore” recommend brainer things prone cytotec au maroc would Spain blade “store” however Overall fantastic supposed as.

of Go vs. Scala in a very similar fashion to the recent WebSockets vs. TCP post.
We wrote a simple “ping-pong” client and server in both Go

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
//SERVER
package main
 
import (
    "net"
    "runtime"
)
 
func handleClient(conn net.Conn) {
    defer conn.Close()
 
    var buf [4]byte
    for {
        n, err := conn.Read(buf[0:])
        if err!=nil {return}
        if n>0 {
            _, err = conn.Write([]byte("Pong"))
            if err!=nil {return}
        }
    }
}
 
func main() {
    runtime.GOMAXPROCS(4)
 
    tcpAddr, _ := net.ResolveTCPAddr("tcp4", ":1201")
    listener, _ := net.ListenTCP("tcp", tcpAddr)
 
    for {
        conn, _ := listener.Accept()
        go handleClient(conn)
    }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
//CLIENT
package main
 
import (
    "net"
    "fmt"
    "time"
    "runtime"
)
 
func ping(times int, lockChan chan bool) {
    tcpAddr, _ := net.ResolveTCPAddr("tcp4", "localhost:1201")
    conn, _ := net.DialTCP("tcp", nil, tcpAddr)
 
    for i:=0; i<int(times); i++ {
        _, _ = conn.Write([]byte("Ping"))
        var buff [4]byte
        _, _ = conn.Read(buff[0:])
    }
    lockChan<-true
    conn.Close()    
}
 
func main() {
    runtime.GOMAXPROCS(4)
 
    var totalPings int = 1000000
    var concurrentConnections int = 100
    var pingsPerConnection int = totalPings/concurrentConnections
    var actualTotalPings int = pingsPerConnection*concurrentConnections
 
    lockChan := make(chan bool, concurrentConnections)
 
    start := time.Now()
    for i:=0; i<concurrentConnections; i++{
        go ping(pingsPerConnection, lockChan)
    }
    for i:=0; i<int(concurrentConnections); i++{
        <-lockChan 
    }
    elapsed := 1000000*time.Since(start).Seconds()
    fmt.Println(elapsed/float64(actualTotalPings))
}

and Scala

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
//SERVER
import java.net._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
 
object main{
 
    def handleClient(s: Socket) : Unit = {
      val in = s.getInputStream
      val out = s.getOutputStream
      while(s.isConnected){
        val buffer = Array[Byte](4)
        in.read(buffer)
        out.write("Pong".getBytes)
      }
    }
 
    def main(args: Array[String]){
      val server = new ServerSocket(1201)
      while(true){
        val s: Socket = server.accept()
        future { handleClient(s) }
      }
    }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
//CLIENT
import scala.concurrent._
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
import java.net._
 
object main{
 
    def ping(timesToPing: Int) : Unit = {
        val socket = new Socket("localhost", 1201)
        val out = socket.getOutputStream
        val in = socket.getInputStream
        for (i <- 0 until timesToPing) {
            out.write("Ping".getBytes)
            val buffer = Array[Byte](4)
            in.read(buffer)
        }
        socket.close
    }
 
    def main(args: Array[String]){
        var totalPings = 1000000
        var concurrentConnections = 100
        var pingsPerConnection : Int = totalPings/concurrentConnections
        var actualTotalPings : Int = pingsPerConnection*concurrentConnections
 
        val t0 = (System.currentTimeMillis()).toDouble
        var futures = (0 until concurrentConnections).map{_ => 
            future(ping(pingsPerConnection))
        }
 
        Await.result(Future.sequence(futures), 1 minutes)
        val t1 = (System.currentTimeMillis()).toDouble
        println(1000*(t1-t0)/actualTotalPings)
    }
}

The latter is almost exactly the same as the one used in the WebSockets vs. TCP benchmark. Both implementations are fairly naive and there is probably room for improvement. The actual test code contained some functionality to deal with connection errors, omitted here for brevity.
The client makes a certain number of persistent concurrent connections to the server and sends a certain number of pings (just the string “Ping”) to each of which the server will respond with the string “Pong”.

The experiments where performed on a 2.7Ghz quad core MacBook Pro with both client and server running locally, so as to better measure pure processing overhead. The client would make 100 concurrent connections and send a total of 1 million pings to the server, evenly distributed over the connections. We measured the average round trip time.

To our surprise, Scala did quite a bit better than Go, with an average round trip time of ~1.6 microseconds (0.0016 milliseconds) vs. Go’s ~11 microseconds (0.011 milliseconds). The numbers for Go are of course still extremely fast, but if almost all your software is doing is taking in a tcp packet and passing on to another endpoint, this can mean a big difference in maximum throughput.

Notable in the opposite direction was that the Go server had a memory footprint of only about 10 MB vs. Scala’s nearly 200 MB.

Go is still new, will likely make performance improvements as it matures and its simple concurrency primitives might make the loss in performance worth it.
Nonetheless, it’s a somewhat surprising result, at least to us. We would love to hear some thoughts in the comments!

37 comments
Oliver
Oliver

Interesting idea for a benchmark. Here is some tweet by Rob Pike from Feb, 24th 2014: "Just looked at a Google-internal Go server with 139K goroutines serving over 68K active network connections. Concurrency wins.". So what happens if you increase the number of connections from 100 to 68k? I'd be curious ;-).

moru0011
moru0011

move "PING"getBytes out of the loop, its pretty expensive

raghupathys
raghupathys

Would really love to see the same tests re-run with nagle turned off. Considering the really small packets being sent - I'm curious what impact having nagle on/off will have on throughput and latency.

kuba
kuba

try to send/receive 1 Byte buffers.

I modified server and client. Both have:val buffer = Array[Byte](1)

 

and client sends: out.write("A".getBytes)

server responses: out.write("B".getBytes)

 

start server, run client ...

it takes 10 times longer for me!

 

sanity1
sanity1

Looking at Scala's memory usage can be a little misleading, since the JVM will generally use memory liberally within the limits you set for it, because it doesn't make sense to spend time garbage collecting before you need to.

hello
hello

One of the worst Go codes I've ever seen.

stasiu88
stasiu88

Why the buffer is one byte long and initialized with a 4?

val buffer = Array[Byte](4)

enry_straker
enry_straker

Why in the world would you try to compare the performance of two languages with an I/O heavy benchmark vs a Cpu heavy benchmark? That's like comparing a ferrari and a porsche with a 1-mile swim?

boyd_stephen_smith_jr
boyd_stephen_smith_jr

in.read(buffer) in the Scala code is non-blocking, so it doesn't actually wait on the "Ping" message to be sent, it immediately writes "Pong" without waiting on the client.

 

On the "Go" side, conn.read(buf[0:]) is also non-blocking but you have an explicit loop that waits for at least one byte from the client.

 

I don't think you are actually testing the same thing with the two separate programs.

bradfitz
bradfitz

Nearly every line of the Go code could be improved or fixed.

 

* Not gofmt'd.

 

* This isn't guaranteed to read any certain amount, but you assume it does:

n, err := conn.Read(buf[0:])

 

* but you should be using bufio or io.ReadFull.

 

* You allocate a lot of memory unnecessarily.  var buff [4]byte    _, _ = conn.Read(buff[0:]) allocates.  So does your []byte{"pong"}.  Move that buffers up higher so they're not allocated each time in the loops.

 

* Unnecessary types in the var lines:

 

  var totalPings int = 1000000

    var concurrentConnections int = 100 

  var pingsPerConnection int = totalPings/concurrentConnections

    var actualTotalPings int = pingsPerConnection*concurrentConnections

 

.... Should be const anyway.  They don't need types yet.

 

etc

 

chepurnoy
chepurnoy

"Go is still new, will likely make performance improvements as it matures and its simple concurrency primitives might make the loss in performance worth it."Scala concurrency primitives also not hard at all

paulrkeeble
paulrkeeble

runtime.GOMAXPROCS(4) is interesting in this case because the Scala version has no such limitation set directly. Since you are using a Mac book pro it is most likely using an i7 CPU, which will have 8 CPUs exposed to the operating system, 4 of which are hyperthreaded. In this case hyperthreading is highly likely to be beneficial because a large amount of the time will be OS and IO based, so the workload will be both very mixed and involving lots of waiting. This alone could account for a large amount of the difference.

 

If I read your results right these tests are taking about 1.5 seconds or so. That isn't sufficient to make any real conclusions, and its kind of important to ensure the system is warmed up before you run a benchmark. Its also critical you capture the request/response timing information as well because the variance could be very high and you would want to know that before making a decision. 

outworlder
outworlder

Why Scala vs Go? Why not try Erlang (maybe with Elixir) instead? 

vmiroshnikov
vmiroshnikov

You can also try something like TinyVM to reduce memory footprint for Scala app. 

hithere2013
hithere2013

go server + go client                              22.02125152

scala server + scala client                    3.469

go server + scala client                          3.562

scala server + go client                          4.766823392

 

rzidane360
rzidane360

It would be good to try the go client against the scala server and vice versa as some people have mentioned. Also worth checking hw these numbers change with num concurrent connections. Also to ensure real results maybe the client should check the results. The scala code is blocking. This is fine for a 100 concurrent connections and probably faster to block too. Maybe your custom load balancer has a similar payload pattern of dense messages on short lived connections, but if connections are idle then this may spawn a ton of futures that will block on reading from the input steam.

funny_falcon
funny_falcon

I'm pretty sure that Java's getOutputStream and getInputStream returns buffered IO objects, while Go's sockets are unbuffered. You should wrap socket by

 

    buf_conn = bufio.NewReaderWriter(conn, conn)    var buf [4] byte

    for {

        n, err := buf_conn.Read(buf[0:])

        if err!=nil {return}

        if n>0 {

             _, err = buf_conn.Write([]byte("Pong"))

             if err!=nil {return}

        }

    }

 

Jonathan Graehl
Jonathan Graehl

@CHF, https://news.ycombinator.com/item?id=6164892 comments say that TCP_NODELAY has the correct default in the latest Go. This was my first thought as well.

 

@Anon, it's very appropriate to measure latency in this way since a client does many reps of [(send request), (wait for and read response)]. buffering (all the way down to the tcp segment level) can only increase latency in this case, since the reads/writes are so small.

CHF
CHF

I'd check that the TCP sockets are set for TCP_NODELAY both sides. The annoying "Nagle algorithm" might be on by default in your system(s), and if so, it will tend to delay small messages. I'd expect a bigger delay than you see, if it were on, but since it's still sometimes enabled, even though it's even more pointless now than when it was invented.

adhominem
adhominem

Years ago, I did a similar thing in University with java-vs-C: Naive implementations of TCP checksum calculations. We found that initial throughput was one order of magnitude better with C than with Java. However, while the C program's perfomance remained constant, once Java's hotspot optimizer kicked in, the picture reversed.

 

If you can deal with the startup costs, Java can be pretty damn fast.

RuddO
RuddO

Now try to make the Scala server reload itself without losing connections with a signal, Goagain style.

 

Good luck.

Dustin
Dustin

(x-post from HN)

 

Did you consider running the go client against the scala server and vice versa?

 

Also, that's kind of a lot of code. Here's my rewrite of the server: http://play.golang.org/p/hKztKKQf7v

 

It doesn't return the exact same result, but since you're not verifying the results, it is effectively the same (4 bytes in, 4 bytes back out). I did slightly better with a hand-crafted one.

 

A little cleanup on the client here: http://play.golang.org/p/vRNMzBFOs5

 

I'm guessing scala's hiding some magic, though.

 

I made a small change to the way the client is working, buffering reads and writes independently (can be observed later) and I get similar numbers (dropped my local runs from ~12 to .038). This is that version:http://play.golang.org/p/8fR6-y6EBy

 

Now, I don't know scala, but based on the constraints of the program, these actually all do the same thing. They time how long it takes to write 4 bytes * N and read 4 bytes * N. (my version adds error checking). The go version is reporting a bit more latency going in and out of the stack for individual syscalls.

 

I suspect the scala version isn't even making those, as it likely doesn't need to observe the answers.

 

You just get more options in a lower level language.

 

Anon
Anon

Did you make sure that the default socket options for send and receive buffering are the same between Go and Scala?

 

If there is buffering, you not only save the cost of locking memory pages for the IO, you also avoid context swaps because the IOs complete successfully right away.

 

Also, you are measuring latency by dividing the total duration by the number of messages. If buffering is on, this will not be a true measurement of latency. You could try using rdtsc to put timestamps in the messages themselves to measure per trip latency, and then average that.

 

I'm not familiar with Scala or Go, so I could be wrong.

 

Cool stuff though, it's always fun to see how microbenchmarks stress the system in different ways.

outworlder
outworlder

One of the worst comments I've ever seen. Does it physically hurt to be constructive?

rzidane360
rzidane360

@stasiu88 That's a good catch. Array[Byte](4) is not equal to new Array[Byte](4). The first one uses the apply method on the Array object.

Jonathan Graehl
Jonathan Graehl

 @boyd_stephen_smith_jr you're right that the Scala in.read may not return all 4 bytes. But I read the InputStream docs just now and it *does* block until at least 1 byte is available. Although the code *is* buggy in exactly the way you point out, I'm confident in this case that it makes no difference (you'll surely receive the whole "pong" or nothing). That said, the correct code would indeed be imperceptibly slower due to the required loop-until-done.

robertmeta
robertmeta

 @chepurnoy But, that doesn't appear to be the issue.  Well -- not the only issue.  Read down the comments... mixing and matching servers and clients gives interesting results, and the go client seems to trend to higher concurrency than the scala client. 

robertmeta
robertmeta

 @hithere2013 Wow, the mismatched servers and clients were a surprise.  I wonder if something is happening odd with running go server + go client on the same machine?

Jack
Jack

 @adhominem The Java Virtual Machine is written in C and hints of C++. It's literally impossible for it to exceed C in performance unless you've made some mistakes in the code or the build.

One exception I can think off is a very bad kernel scheduler taking the JVM's huge ram footprint as a sign it should be allocated more CPU time while the C version is so efficient, it's being pushed down the line.

Jonathan Graehl
Jonathan Graehl

 @thenaquad Wrong - the "winning" Go version writes all the "ping" at once up front, which is totally different (obviously faster - probably a single TCP segment, for example) from being limited by latency "ping" (wait to receive pong) "ping" .... 

thenaquad
thenaquad

 @DavidLiman Just tried. Didn't help. Looks like one needs to check out socket options. I have serious suspcions that JVM presets some options by default.

milton baxter
milton baxter

@outworlder right on. Ni need for such comments as hello

boyd_stephen_smith_jr
boyd_stephen_smith_jr

 @Jonathan Graehl That must be a change when they introduced java.nio in 1.4.  Most of my Java sockets programming comes from before then.  I agree that the current docs definitively say, if it doesn't throw an exception, it will block until at least 1 byte is available.  I also agree that the "Pong" will be received whole.

rzidane360
rzidane360

 @adhominem This is an often repeated logic. Yes the Java Virtual Machine is written in C/C++. So what? C++ compilers produce machine code from your source code. If GCC itself was written in Python the code generation would be possibly slower but would your compiled binary be any different? The JVM JIT also produces assembly code, not C code. Why is it theoretically not possible for the JIT compiler to produce more optimized code than any other C++ compiler say G++ or Clang? There are quite a few optimizations applicable at run time not compile time.

 

I am not actually saying that JIT-ted Java is actually faster than finely tuned C++ compiled with a good compiler, but the reason it's not is not because the JVM was written in C++ itself.