Flower Filter for Key-Value Events

In my last post, I presented a simple time-decaying approximate membership filter. In this post I would like to extend it for key-value events.

Flower Filter is a membership filter. It can answer, with time-decaying accuracy, whether a particular event happened recently. When events are key-value pairs, the query is whether a particular key-value pair occurred recently. We will attempt to identify variations of Flower Filter that capture some semantics of key-value data. More specifically, we consider the following three scenarios:

  1. Remembering the latest value for a key
  2. Remembering a few recent values for a key
  3. Remembering a few recent values with rough frequency estimates

Now I will go over how these scenarios can be implemented with modified filters. I am not going to provide deep mathematical analysis here—just an intuitive explanation.

Remembering the latest value for a key

It is quite easy to change the filter to remember one value per key. In the basic filter, we used hash functions to compute locations in the array, and we use another set of hash functions to compute fingerprints that are written to the locations. If we apply this method to <K, V> pairs, different pairs are hashed into different locations. We will end up remembering many values for the same key. What we want here is to remember a single value per key. In this case, hashing the keys to compute locations is all we need to do. Therefore, the hash functions should take only a key as an input, [ locationi = hi(K) ]. How about fingerprints? One way is to have the fingerprint functions take only a value as an input, [ fingerprinti = fi(V) ]. If the cardinality of V is low, there will be a problem. For example, think about boolean values. There will be collisions everywhere! It is better to have the fingerprint functions take both K and V, [ fingerprinti = fi(K, V) ]. With this setup, old values for the same key are always removed from the filter.

Remembering a few recent values for a key

A recursive application of the method will give us this functionality. A nested filter for a given key K is formed by all locations designated by the hash functions [ locationi = hi(K) ]. Interestingly, decay is caused globally by variations of keys and locally to the nested filter by variations of values. We can decide parameters for this nested filter according to the application requirements.

Remembering a few recent values with rough frequency estimates

With a simple modification, the nested filter can give a very rough estimate of the frequencies of recent values. In the basic filter, the number of matching fingerprints has a strong correlation with the age of an event. We are going to change this behavior a bit. The number of matching fingerprints will reflect the frequency if we write fingerprints to more locations for frequent values. We will do that by randomizing the locations within the nested filter. The intuition behind this is very simple: a frequent value has more chances to write fingerprints to different locations before overwritten by other values.

In this blog post, I presented interesting modifications to Flower Filter. Flower Filter is very simple, yet it has many utilities. Although its accuracy may not be great, it definitely conveys approximate information in a usable manner.

Creating an async consumer Iteratee in Play!

The Play! Iteratee framework is very nice for reactively handling data streams. If you haven’t worked with Iteratees yet, think of them as consumers/sinks with a state. For example, in the Chat room example provided with the Play source, Iteratee.foreach imperatively consumes new messages, passing them to an actor to process.

// Create an Iteratee to consume the feed
val iteratee = Iteratee.foreach[JsValue] { event =>
  default ! Talk(username, (event \ "text").as[String])
}.mapDone { _ =>
  default ! Quit(username)
}

There’s an interesting subtlety that could be missed in this example. Iteratee.foreach internally uses fold and simply discards the function’s result on each step. This may be surprising — in Scala collections, it’s the other way around, fold is implemented with foreach — but makes sense, since an Iteratee is a state machine. A side effect of this implementation is that foreach still steps through input one frame at a time, even though the result is always a Unit:

def foreach[E](f: E => Unit): Iteratee[E, Unit] = fold[E, Unit](())((_, e) => f(e))

So, if you’re dealing with a high volume of data into one Iteratee, or processing a frame is not immediate, you may be surprised to find blocking behavior.

If you’re always side effecting, an easy solution is to immediately offload the request to an actor system (preferably multiple actors to concurrently process the data).

// If you have a consumer like this:
val iterateeBlocking = Iteratee.foreach[JsValue] { data =>
  process(data)
}
// Here are some alternatives:
val iterateeFuture = Iteratee.foreach[JsValue] { data =>
  Akka.future { process(data) }
}
val iterateeActor = Iteratee.foreach[JsValue] { data =>
  actor ! data
}
// or to understand what's happening under the covers, or
// if you'd prefer to work with the state machine directly:
import play.api.libs.iteratee._
private def asyncIteratee[T](f: T => Unit): Iteratee[T, Unit] = {
  def step(i: Input[T]): Iteratee[T, Unit] = i match {
    case Input.EOF => Done(Unit, Input.EOF)
    case Input.Empty => Cont[T, Unit](i => step(i))
    case Input.El(e) =>
      Akka.future { f(e) } // or another async handler
      Cont[T, Unit](i => step(i))
  }
  (Cont[T, Unit](i => step(i)))
}
val iterateeAsync = asyncIteratee[JsValue] { data =>
  process(data)
}

If the last bit is confusing, I encourage you to read the Iteratee.scala source. An Iteratee is in one of three states: Done, Cont, or Error. New Input is either EOF, Empty, or El(value). With EOF, we return an Iteratee that is forever in the Done state. With Empty, we’re happy to continue accepting input by returning an Iteratee in the Cont state. When we have input with El(value), we asynchronously process it, and return the an Iteratee in the Cont state.

This way, the Iteratee can immediately process more input. This is especially appropriate when your data is arriving in quick bursts, so each input can be processed independently to minimize the time between arrival and the end of processing. Obviously, this breaks FIFO and any ability for input (besides EOF) to affect the state. For most Iteratee.foreach uses that I’ve seen, however, processing input asynchronously will greatly minimize average input processing time.

Working with JSON in Play 2.1

Play 2.1 now provides a really nice library for formatting Scala objects as JSON. As we’ve migrated from Play 2.0 to Play 2.1, we’ve been taking advantage of several of its features to reduce the size and complexity of our code.

To demonstrate how we use it, let’s start with a simple example. Suppose I have a Person object that I want to both serialize and deserialize:

case class Person(
  id: Option[Long] = None, firstName: String, lastName: String, birthDate: LocalDate)

Here we use the macro-based Json.format to create a reader and writer for the Person class:

import play.api.libs.json._
 
implicit val personFormat = Json.format[Person]
 
Json.toJson(Person(Some(1), "Arthur", "Dent", "arthur@example.com", new LocalDate("1952-03-11")))
// => {"id":1,"firstName":"Arthur","lastName":"Dent",birthDate":"1952-03-11"}

But in our codebase things are a bit more complicated. We like to have type-safe IDs, and might use a Name wrapper for the first and last name, so our person might actually look like this:

case class Id[T](id: Long) { override def toString = id.toString } // in Id library
 
case class Name(firstName: String, lastName: String) {
  def firstLast = s"$firstName $lastName"
  def lastFirst = s"$lastName, $firstName"
}
case class Person(
  id: Option[Id[Person]] = None, name: Name, email: String, birthDate: LocalDate)

In that case we have a helper to generate our ID serializer for each ID type, and generate the Name serializer using Json.format

def idFormat[T]: Format[Id[T]] = Format(
  __.read[Long].map(Id(_)),
  new Writes[Id[T]]{ def writes(o: Id[T]) = JsNumber(o.id) })
 
implicit val personIdFormat = idFormat[Person]
implicit val nameFormat = Json.format[Name]
implicit val personFormat = Json.format[Person]
 
Json.toJson(Person(Some(Id(1)), Name("Arthur","Dent"), "arthur@example.com", new LocalDate("1952-03-11")))
// => {"id":1,"name":{"firstName":"Arthur","lastName":"Dent"},"email":"arthur@example.com","birthDate":"1952-03-11"}

You can see that we need to be a bit more verbose when creating formatters for single-element case classes because of limitations in the Play API.

Using functional combinators

Sometimes we want the external JSON to be structured differently than our Scala code. Suppose we want to place the firstName and lastName properties directly in the JSON for a Person and want to validate the email address format. Then we can use the functional combinators to construct our Format[Person]:

import play.api.libs.functional.syntax._
 
implicit val personFormat: Format[Person] = (
  (__ \ "id").formatNullable[Id[Person]] and
  ((__ \ "firstName").format[String] and
    (__ \ "lastName").format[String])(Name.apply, unlift(Name.unapply)) and
  (__ \ "email").format(Reads.email) and
  (__ \ "birthDate").format[LocalDate]
)(Person.apply, unlift(Person.unapply))
 
Json.toJson(Person(Some(Id(0)), Name("Arthur","Dent"), "arthur@example.com", new LocalDate("1952-03-11")))
// => {"id":0,"firstName":"Arthur","lastName":"Dent","email":"arthur@example.com","birthDate":"1952-03-11"}
 
Json.fromJson[Person](Json.parse("""{"id":0,"firstName":"Arthur","lastName":"Dent","email":"arthur@example.com","birthDate":"1952-03-11"}"""))
// => JsSuccess(Person(Some(Id(0)),Name(Arthur,Dent),arthur@example.com,1952-03-11),)
Json.fromJson[Person](Json.parse("""{"id":0,"firstName":"Arthur","lastName":"Dent","email":"example.com","birthDate":"1952-03-11"}"""))
// => JsError(List((/email,List(ValidationError(validate.error.email,WrappedArray())))))

Notice we used the provided Reads.email to validate the email (along with the implicit string writer). In practice, we might use an Email value class wrapper and write a serializer for that, but I’ll leave that as an exercise for the reader.

Backwards compatibility

If we’ve decided to store our JSON anywhere and then change the serializer, we might run into backwards-compatibility issues when reading back the old JSON. There are a couple of ways to deal with this problem. One is to keep your old reader around and define a new reader using orElse:

val personFormatOld: Format[Person] = ??? // old serializer
val personFormat: Format[Person] = ??? // new serializer
implicit val personReads: Reads[Person] = personFormat orElse personFormatOld
implicit val personWrites: Writes[Person] = personFormat

We can also define default values for properties we added. Let’s say I just added the email property:

implicit val personFormat: Format[Person] = (
  (__ \ "id").formatNullable[Id[Person]] and
  (__ \ "name").format[Name] and
  (__ \ "email").formatNullable(Format(Reads.email, Writes.StringWrites))
    .inmap[String](_.getOrElse(""), Some(_).filterNot(_.isEmpty)) and
  (__ \ "birthDate").format[LocalDate]
)(Person.apply, unlift(Person.unapply))
 
Json.fromJson[Person](Json.parse("""{"id":0,"name":{"firstName":"Arthur","lastName":"Dent"},"birthDate":"1952-03-11"}"""))
// => JsSuccess(Person(Some(0),Name(Arthur,Dent),,1952-03-11),)

Note that we can use Format#inmap (or Reads#map/Writes#contramap) to do a transformation between the Option[String] we get from formatNullable and a String with a default value.

Formatting DateTimes

It’s also nice to have DateTime values printed as human-readable strings (as is the case for LocalDate shown in our example). We can easily define a formatter to do this. The format pattern used is the same as in SimpleDateFormat.

import play.api.libs.json._
 
val pattern = "yyyy-MM-dd'T'HH:mm:ssz"
implicit val dateFormat =
  Format[DateTime](Reads.jodaDateReads(pattern), Writes.jodaDateWrites(pattern))
 
Json.toJson(new DateTime("1952-03-11")) // "1952-03-11T00:00:00PDT"

Conclusion

As you can see, Play 2.1 makes it really easy to read and write Scala case classes as JSON. We can create simple serializers using the macro-based approach, and deal with more complex cases using the functional combinators.

For more comprehensive information, you might find Play’s documentation on JSON combinators and handling JSON requests to be helpful.

Flower Filter – An Update

A few days ago, Yasuhiro published an article on Flower Filter.

The main motivation of Flower Filter is that we need a data structure that remembers newer data with higher probabilities and older data with lower probabilities. We call such probability “survival probability”. In this post, we discuss some math subtlety of the survival probability.

First, let’s have a quick review of the background. We have an integer array A of size s, and A stores the fingerprints of events. When an event E comes, we use n hash functions to compute n location indexes, and E stores its fingerprints in these locations. (Due to hash collisions, may take fewer than n locations in array A. But let’s ignore this for now.) Suppose t events come after event E. These events may overwrite event E’s fingerprints. We say event E survives if at least one of its n fingerprints is intact.

In the original article, the survival probability Φ of event is computed as:

wrong

Here, p is the survival probability of one fingerprint of event E, and it is computed as:

eq3

The summands in the summation are binomial probabilities. Thus, an implicit assumption is: the survival of different fingerprints of one event are independent. If you are really picky about math, you will find this assumption is not valid.

A simple counter-example: consider n=2, s=4, t=1. In other words, array A can hold up to 4 fingerprints, and each event can store up to 2 fingerprints. Suppose initially an event E stores two fingerprints in slots 1 and 2. We need to compute the survival probability of E if there is one more event (since t=1).

If the survival of slot 1 and slot 2 were independent, then the conditional probability p(slot 2 intact given that slot 1 intact) should equal p(2 intact). But,

not-indepedent

Thus, intactness of 1 and 2 are not independent.

To get around this dependency problem, we can use the inclusion-exclusion principle. Let Xi denote the case that the i-th fingerprint of event E is intact after t events. We have

correct-survival-prob

Clearly, this looks very different from the original formula. The question is, in reality, does this matter much?  The answer is no, and we hope the plot below can convince the reader. Here, the green curve represents the new formula, the bule curve for the old formula.

flower-filter-compare

As we can see, the difference is negligible. In fact, in our system, n<<s, so the difference is even smaller.

The new formula assumes that event E stores exactly n fingerprints in array A. If you are picky, you can find the total probability by marginalizing over all cases (E stores exactly i fingerprints, i = 1, …, n). But we believe the difference is marginal.

Conclusion: In the original Flower Filter blog post, we implicitly assumed that the survival probabilities of an event’s different fingerprints are independent. In this article, we showed that that assumption does not hold, derived a new formula to compute the survival probability of an event, and demonstrated that the difference is negligible in practice.

Doing real-time communication? Secure WebSockets vs HTTPS Benchmarks

We benchmarked round-trip times with Secure WebSockets (wss://) vs HTTPS to see if there was a noticeable difference in response speed. For Secure WebSockets, we timed the difference between the client sending a “ping” packet, and the server responding with “pong”. This included going through an Akka router that managed the Play! Iteratees. For standard HTTPS, we timed the round-trip of the client initiating a request to an endpoint that immediately replies with “Ok”. The sample size is 1000 per type.

WSS vs HTTPS benchmark

The median improvement using WebSockets was 3ms, which is slightly misleading due to the bimodal distribution of response times. On average (mean), we saw a 13ms round-trip improvement using WebSockets.

A few caveats:

  • If you need to support a wide array of browsers, you will need to implement WebSockets along side other real-time techniques like long-polling, since IE doesn’t support WebSockets until version 10.
  • This is only testing one aspect of real-time communication: client to server to client. For anything server to client, long polling already establishes a connection as well, so WebSockets should not have any advantage.
  • HTTPS overhead is almost completely in the initial handshake. For the standard HTTPS requests, we configured the keep-alive so that the test is as fair as possible. For both, the initial handshake will take a few multiples of the average ping time.
  • Since the WebSocket requests went through our Akka WS router, there was a slight additional overhead.
  • We happen to be very close to our EC2 data center, so ping times for both are low. If your users are far from your nearest server, then this improvement may be completely negligible.

Building nginx to use WebSockets

With our Play! applications, fast real time server ↔ client communication is very important. Fortunately, Play! has a great API for handling data reactively with Iteratees, Enumerators, and Enumeratees.

Fortunately, nginx, which is often used as a proxy to Play!, just released WebSocket support in version 1.3.13. It’s still a development release, so many package managers don’t have a build yet.

To build Nginx 1.3.13, grab the source from the nginx download page. Be sure to make backups of your current nginx configurations, as the make install later will overwrite them. On our EC2 instance, we made backups of /etc/nginx/ and /etc/init.d/nginx.

If you would like to change the display name of your server, now’s a great time. Edit src/http/ngx_http_header_filter_module.c lines 48 and 49 to be whatever you want. Next, run ./configure in the nginx source directory. You can set several preferences here, such as config locations, log paths, etc. To keep yours the same as before, consult the Install Options page in the nginx wiki. We used ./configure --with-http_ssl_module --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --conf-path=/etc/nginx --pid-path=/var/run/nginx.pid --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --user=nginx.

Next, issue make, then shut down your nginx server with service nginx stop on many distributions. If you were previously using a package manager’s build, you can safely uninstall it now. Run make install, then replace your old 1.2.x configuration files, and bring the server back up. If all’s good, you should be back up and running with nginx 1.3.13.

To enable websockets, add the following to your approperiate server configs (setting your websocket path correctly):

  location /ws {
    proxy_pass http://yourServerAlias;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_set_header Host $host;
  }

Flower Filter – A simple time-decaying approximate membership filter

In this post, we would like to talk about a simple data structure we designed for an approximate membership filter with probabilistic time decay. It doesn’t sound simple, but the filter is easy to implement and is highly concurrent while achieving a very low memory overhead.

In our system there are a lot of incoming events. The system needs to detect recurring events and apply a special processing to them. Some people may think of using LRU cache for that task, but we don’t use it. Why? If the accuracy matters, LRU is a good way, but for us the accuracy is not the highest priority. Our data structure needs to remember newer data with a higher probability and older data with a lower probability. In addition, we are going to have a large number of filters in the system, so a low memory overhead is required. Lastly filters are serialized/deserialized frequently, and they need to be done fast. Therefore, we looked for a different solution than LRU, and we came up with the data structure in this blog.

Let’s start with the most naive way. Suppose an event is an integer number E, and we have an integer array of size s. For each event, the hash value is computed to locate the position in the array, and E is written in the position. An older event is more likely to be overwritten by newer events. The probability of survival of an event is

eq1

t represents the number of events after this event is inserted. This is an exponential decay function.

mhf-graph0

Although this works, in a real environment a hash collision may occur. If this happens among frequent events, they overwrite each other. Therefore we need to reduce the collision rate somehow.

How about having more hash functions? Instead of writing to a single place in the array, we write the value to multiple places using multiple hash functions. The probability that all hash values collide should be lower.  Let n be the number of hash functions. An event survives when at least one out of n places is not overwritten. How does the survival probability change? For brevity, we will gloss over the fact that collisions may occur among the hash values of a single event.

Every event writes to n places in the array. An event survives if at least one out of n places is intact. The survival probability φ of an event after t event insertions is

eq2

where

eq3

The following graph shows a few examples of different n (s is fixed to 1000).

mhf-graph1

With a higher n, the probability declines slowly at the beginning, which means that the filter remembers the recent events very well. The curve becomes closer to a step function as n increases. It is obvious, however, that the overall capacity of remembering events decreases as n gets bigger.

In order to compensate the capacity decrease let’s consider increasing s. In the following graph, s is increased proportionally to log2(n).

mhf-graph2

The capacity is still decreasing, but the size of log2(n) order seems to be a reasonable guide line.

We would like to use a bigger n, but the space efficiency is an issue. It will be great if we can decrease the required storage size without decreasing the capacity. Let’s try an approximate matching instead of the exact matching and examine how the approximation affects the accuracy.

We will reduce the bit size of an array entry. Another hash function is used to map the 32-bit integer to a smaller number of bits. Let’s call it a fingerprint and let b be the size of a fingerprint. Intuitively the probability of false positive f at a single place is

eq4

Considering this, the probability φ, which is now the probability that the filter answers “YES” including false positives, is now written as

eq5

where

eq6

In the next graph, the probabilities are shown with a few variations of n. b is fixed to 8.

mhf-graph3

Now the curves don’t approach to zero. The effect of false positive is surprisingly high for a big n. We cannot choose a small b when we use a large n. In my opinion, practical choices of parameters are { b = 8, n < 4 } and { b = 16, 4 ≤ n ≤ 32 }.

We designed a very simple data structure for an approximate membership filter with probabilistic time decay. We have three parameters (n, b and s) to configure according to application’s needs. The filter is easy to implement and is highly concurrent since there is no pointers to update. Most importantly, it has a very low memory overhead since it consists of just one array object.

Pattern matching in Scala 2.10

One of the improvements that came with Scala 2.10 is completely rewritten and improved pattern matching. In addition to fixing some performance issues (e.g. this exponential space bytecode bug), it fixes a few of the annoyances I’ve had with pattern matching in Scala and gives much more useful errors at compile time.

Better matching on sequences

Now we can match on Seq more like we match on lists using +: and :+. For example:

def sort[T <% Ordered[T]](seq: Seq[T]): Seq[T] = seq match {
  case Seq() => Seq()
  case x +: xs =>
    val (before, after) = xs partition (_ < x)
    sort(before) ++ sort(x +: after)
}
sort(IndexedSeq("Greg", "Eishay", "Andrew")) // => List(Andrew, Eishay, Greg)

Since it’s preferred to use the Seq type over List in most places, this should probably be preferred over :: when possible.

Note that xs :+ x captures the last element of the list. You should generally only do this with an IndexedSeq or some other collection that can get the last element efficiently.

Compile-time checking of type arguments

In 2.9, type arguments would not be checked when destructuring many types. For example, the following code compiles in 2.9:

val (a, b): (Int, Int) = ("a", "b")
val List(a, b, c): List[Int] = Some("hello")

The compiler would issue a warning, and then we’d get a ClassCastException when running the code:

val (a, b): (Int, Int) = ("a", "b")
:7: warning: non variable type-argument Int in type pattern (Int, Int) is unchecked since it is eliminated by erasure
       val (a, b): (Int, Int) = ("a", "b")
                   ^
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
...

In 2.10, the type checker is smarter, and we get a nice message at compile time:

scala> val (x, y): (Int, Int) = ("a", "b")
:7: error: type mismatch;
 found   : String("a")
 required: Int
       val (x, y): (Int, Int) = ("a", "b")

Abstract types

We can also now match on abstract types. So we can do stuff like:

scala> import reflect.ClassTag
import reflect.ClassTag
scala> def filterBy[T: ClassTag](seq: Seq[Any]) = seq collect { case x: T => x }
filterBy: [T](seq: Seq[Any])(implicit evidence$1: scala.reflect.ClassTag[T])Seq[T]
scala> filterBy[String](Seq(1, "two", "three"))
res0: Seq[String] = List(two, three)

Note that, unfortunately, we can’t match value types like Int when they are boxed (we have to match the Java wrapper type).

Speed up your Play! controller tests 100x

In the previous post, I explained how to use Guice with Play! 2.1 for easy dependency injection. At the end, I demonstrated how to test Play! controllers directly, without creating a FakeApplication. This is great for testing actual controller functionality, without the overhead of loading a new Play! application. It also works very well with Play 2.1′s constructible controllers, allowing you to explicitly pass in class instances that you use.

The performance boost is quite significant. I actually didn’t mention how much direct controller tests speed up your unit tests, so I created a simple benchmark using the Play! 2.1 Guice Example application.

I ran the 3 tests from that project (normal Play! tests, explicitly injected controller, and direct controller instance) 10,000 times, and took the median time (measured in nanoseconds, displayed here in milliseconds).

def timer(f: => Any): Long = {
  val startTime = System.nanoTime
  val ret = f
  val endTime = System.nanoTime
 
  endTime - startTime
}
 
def iterTest(f: => Any): Double = {
  (for(i <- 0 to 10000) yield timer(f)).sorted.apply(5000)
}
 
val norm_time = iterTest {
  running(FakeApplication()) {
    val translated = route(FakeRequest(GET, "/greet/Barney")).get
 
    contentAsString(translated) must contain ("Barney")     
  }    
}
println(s"Normal Play! way:\t$norm_time")
 
val inj_time = iterTest {
  running(FakeApplication(withGlobal = Some(FakeTranslatorGlobal))) {
    val home = route(FakeRequest(GET, "/greet/Barney")).get
    contentAsString(home) must contain ("Hello Barney")
  }
}
println(s"Injected:\t\t$inj_time")
 
 
val direct_time = iterTest {
  val controller = new Translate(new FakeTranslator)
  val result = controller.greet(name = "Barney")(FakeRequest())
  contentAsString(result) must contain ("Hello Barney")
}
println(s"Direct:\t\t\t$direct_time")

The results? Direct controller tests performed two orders of magnitude faster than standard Play! tests. Normal Play! controller testing took 5.52ms, the explicitly injected controller took 4.51ms, and the direct controller call took 0.09ms.

This result is intuitive (much less overhead), but many large Play! applications fail to test this way. For a well-tested codebase with thousands of tests, this can shave several minutes off each deployment. In conjunction with dependency injection, this removes code and time bottlenecks to having a very well tested codebase.

Using Guice with Play! Framework 2.1 for easy Dependency Injection

One of Play! Framework 2.1‘s latest features is the ability to construct controllers, so you can use dependency injection. Briefly, the benefits of using dependency injection is that you will have more modular, extensible, flexible, and testable code. At FortyTwo, we use Guice (a lightweight DI framework by Google) for many of our system components. I’ll show you how to easily integrate Guice into an example Play! 2.1 Scala project and then show you how this greatly improves your codebase’s flexibility.

TL;DR:

At its heart, dependency injection allows you to say something like “I’d like a DbConnection instance”, and not have to worry about where it comes from. Additionally, you can configure exactly which DbConnection to provide elsewhere. So, you can write your component very simply by injecting a DbConnection, and centrally configuring which implementation to use in which conditions. For development and testing, you may want to use an in-memory database, and for production, use a “real” database (example gist).

Integrating Guice with Play! Framework 2.1

For this example, we’ll create and test a very simple translation component. We’ll use a library called sse-guice to simplify Guice syntax, allowing us to use bind[Service].to[ServiceImpl].in[Singleton] instead of bind(classOf[Service]).to(classOf[ServiceImpl]).in(classOf[Singleton]). Ah, much prettier.

First, add Guice and sse-guice to your Build.scala app dependencies:

  val appDependencies = Seq(
    "com.google.inject" % "guice" % "3.0",
    "com.tzavellas" % "sse-guice" % "0.7.1"
  )

We’ll create a simple Translator trait, which will take an input String, and output its translation. For this demo, we’ll implement it with simple word-replacement for French.

trait Translator {
  def translate(input: String): String
}
 
class FrenchTranslatorImpl extends Translator {
  val wordReplacements = Map(
    "hello" -> "bonjour",
    "hi" -> "salut",
    "greetings" -> "salutations"
  )
  def translate(input: String): String = {
    input.split("""\s+""").map(word => wordReplacements.get(word.toLowerCase).map(newWord => word match {
        case _ if word(0).isUpper => newWord.capitalize
        case _ => newWord
      }).getOrElse(word)).mkString(" ")
  }
}

In a real application, you might implement translate() by calling a 3rd party API. Either way, we’ll treat this as something that’s blocking or is particularly slow. For testing, you might not be interested in the actual translation, but how parts of your application work with translated strings. So we’ll make a FakeTranslator (which just returns the input):

class FakeTranslator extends Translator {
  def translate(input: String): String = input
}

Next, we’ll define two modules for different circumstances, DevModule (development and testing) and ProdModule (for production). Whenever we need a translator in production, we’ll use the FrenchTranslatorImpl. Otherwise, we’ll use our more efficient FakeTranslator.

package common.modules
 
import com.tzavellas.sse.guice.ScalaModule
import common.translation._
 
class ProdModule extends ScalaModule {
  def configure() {
    bind[Translator].to[FrenchTranslatorImpl]
  }
}
 
class DevModule extends ScalaModule {
  def configure() {
    bind[Translator].to[FakeTranslator]
  }
}

Let’s create a new controller to use our (very rudimentary) translator.

package controllers
 
import play.api._
import play.api.mvc._
import common.translation.Translator
import com.google.inject._
 
@Singleton
class Translate @Inject()(translator: Translator) extends Controller {
  def greet(name: String) = Action {
    Ok(views.html.greet(translator.translate(s"Hello $name!")))
  }
}

You’ll notice a couple things are different than usual. Normally, Play! controllers are singleton objects, which cannot take parameters (so sadly, can’t be given injected classes). Here, we make Translate a class, and annotate it with @Inject(), which tells Guice to inject parameters. Next, we simply ask for a Translator instance. Notice we don’t ask which translator that we need.

The real magic happens in Global.scala, in a new method to Play 2.1 called getControllerInstance, which returns an instance of a requested controller (with Guice voodoo applied!). Additionally, we create the injector based on which mode Play is running in.

object Global extends GlobalSettings {
  private lazy val injector = {
    Play.isProd match {
      case true => Guice.createInjector(new ProdModule)
      case false => Guice.createInjector(new DevModule)
    }
  }
 
  override def getControllerInstance[A](clazz: Class[A]) = {
    injector.getInstance(clazz)
  }
}

Finally, to let Play! know to use getControllerInstance, we annotate the route with an @:

GET /greet/:name   @controllers.Translate.greet(name: String)

Now, when we start Play! in development mode, we get:

And in production mode:

So what? Why should I use DI with Play?

Notice that we needed no factories and were able to keep our code very nicely organized. We can now easily swap out Translator implementations with no mass code replace. But here’s a really cool benefit: Previously, testing Play! controllers was a lot harder and didn’t let us dynamically inject classes into controllers. We couldn’t easily swap out a HTTP client, email sender, or clock. However, now, we can very easily mock out classes when testing our controllers.

class TranslateSpec extends Specification {
 
  "Translate" should {
    // The normal Play! way
    "accept a name, and return a proper greeting" in {
      running(FakeApplication()) {
        val translated = route(FakeRequest(GET, "/greet/Barney")).get
 
        status(translated) must equalTo(OK)
        contentType(translated) must beSome.which(_ == "text/html")
        contentAsString(translated) must contain ("Barney")     
      }
    }
 
    // Providing a fake Global, to explitly mock out the injector
    object FakeTranslatorGlobal extends play.api.GlobalSettings {
      override def getControllerInstance[A](clazz: Class[A]) = {
        new Translate(new FakeTranslator).asInstanceOf[A]
      }
    }
    "accept a name, and return a proper greeting (explicitly mocking module)" in {
      running(FakeApplication(withGlobal = Some(FakeTranslatorGlobal))) {
        val home = route(FakeRequest(GET, "/greet/Barney")).get
        contentAsString(home) must contain ("Hello Barney")
      }
    }
 
    // Calling the controller directly, without creating a new FakeApplication
    // (This is the fastest)
    "accept a name, and return a proper greeting (controller directly, no FakeApplication!)" in {
      val controller = new Translate(new FakeTranslator)
      val result = controller.greet(name = "Barney")(FakeRequest())
      contentAsString(result) must contain ("Hello Barney")
    }
  }
}

The first way is the typical way Play! controllers are tested. The second is using a custom Global, which explicitly sets the getControllerInstance to return a FakeTranslator instance. Using this pattern, you can easily mock classes that you are testing around. A common use-case is mocking a HTTP client with a class that gives a pre-defined response. Finally, the 3rd method avoids starting a FakeApplication entirely, and instead calls the controller directly with our FakeTranslator.

Conclusion

Play! 2.1 has opened the door for developers to better organize and more thoroughly test their code using dependency injection. Because we can now construct controllers with injected classes, controllers can be more thin and tests are significantly faster. Check out the Github repo for this example, which you can use to bring Guice to your own projects.