Wednesday, October 29, 2008

How Big is an EObject?

I've been thinking a lot the last while about how best to reduce the footprint for a "bare" EObject. The most basic implementation we have is BasicEObjectImpl. It's the abstract base class that all modeled objects should extend; we can and do add methods to InternalEObject so extending this ensures binary compatibility. While BasicEObjectImpl declares no fields, it implements all reusable logic in a highly factored way. EObjectImpl extends it and declares fields for the commonly used features, leaving a properties holder field to point at an instance that in turn declares fields for the less commonly used features. Unfortunately, calling eContents() and eCrossReferences() causes this holder to be fluffed up. Of course derived classes can and do specialize that to instead always create a new list---these lists are just views---and thereby avoid this, but the general problem is that once the properties holder is allocated, it sticks around. For example, when you turn an object into a proxy while unloading a resource, it shows up to hold the proxy URI. Let's take closer peek.


Here's a bit of simple math, based on a 32 bit JVM, for the memory footprint of a bare fully fluffed up EObjectImpl. An object with no fields has 8 bytes of heap overhead. Each int field and each non-primitive field has 4 bytes of overhead. As such, EObjectImpl with its 5fields has minimally 8 + 5 x 4 = 28 bytes. Once you call EObject.eAdapters(), the adaper list is fluffed up. It uses a relatively smart implementation that has no backing array until there is at least one element in the list, but still it has 8 + 4 x 4 = 24 bytes of overhead. Now call EObject.eContents() and EObject.eCrossReferences(). Originally it was assumed that not many clients would use these but in actual fact they ended up being exceedingly powerful and heavily used in the famework itself and by clients. After the calls, the properties holder is fluffed up so add 8 + 6 x 4 = 32 bytes along with, for each list, 8 + 3 x 4 = 20 bytes. Recall the lists are views, so they never change their footprint regardless of the size. Add that all up, and we're at 124 bytes. To put that in perspective, that's as much space as used by a 44 character-length String. Of course I've already pointed out that it's trivial to reduce this to 76 bytes by not caching the contents and cross references lists, but that's still a lot of nuts to stuff in those little cheeks.


I've been thinking a lot about what's the best possible thing we could do to reduce this and came up with the design that's prototyped in 252501. If you want to store data on the object itself---maps could be used, but that will make the footprint worse, not better---minimally you'd need at least one field. That field would reference a structure that holds only the necessary fields. You could trivially use an array, but casting is very expensive. You'd be shocked if you measured it. Unfortunately most people don't measure performance of micro operations and simply assume something that looks trivial is also trivially cheap. Wrong. An array is also not ideal for storing primitives. Ideally you'd want something from which you could fetch the data without down casting from Object. In any case, fewer trips to empty out the cheeks would be good.


Imagine instead having a class for each combination of fields you require. You'd need quite a few of them, unfortunately, but that would be space optimal while also avoiding casting. When you needed to set a field, you'd create whatever class instance is required to hold that field along with the other fields already set, or, when unsetting a field, you'd create a smaller instance without that field. The thought of writing all those classes is unbearably stupid, so of course generating them is the order of the day; work smarter, not harder. That way you can easily modify the pattern without the mind numbing tedium of writing 2^6 classes by hand. I'm pretty happy with the result, though not so happy with the 75K it adds to the jar. I need to think if there are ways to reduce the amount of byte code...

Some cool things that came out of this is the fact that the eDeliver flag, i.e., whether or not the notifications will be produced, does not require any storage; the implementation class simply returns true or false from its hard coded method as appropriate. The patch in the bugzilla shows the templated used to generate this and CompactEObjectImpl shows the template's result as well as how it's used to fully implement a modeled object. Note that I hacked DynamicEObjectImpl to use this new base class only so I could run the core test suite to verify CompactEObjectImpl's correct behavior. It's a good feeling to know all is well.


In combination with this storage approach, I've also created a new array-backed list implementation that avoids ever modifying the backing array; it allocates an array of exactly the required size and never modifies it after populating it correctly. Now, instead of caching a list implementation for the eAdapters, I can cache only the array of adapters themselves, which of course is null for an empty list of adapters. Add to that the trick of checking, when the adapters array is replaced with a new one, if that replacement is equal to the one held by the container, and caching the container's instance instead, and we've ensured that a tree of objects, where each object has the same list of adapters, uses a single shared array instance. This is particularly valuable for ChangeRecorder, EContentAdapter, ECrossReferenceAdapter, and any adapter, like UML's CacheAdapter, that uniformly adds itself to the entire tree of objects. Are you all perked up for the final result?


If we do the same calculations for this implementation, a fully fluffed up CompactEObjectImpl takes only 8 + 1 x 4 = 12 bytes of storage. I think that's impossible to beat. I know of a large company that will be very happy with this. Isn't open source grand? If we could just reduce the byte code for all those darned storage subclasses, without paying the cost of casting, I'd be ecstatic. Anyone out there with creative ideas? A contribution would make a nice birthday present. Speaking of which, happy birthday Darin.

8 comments:

Mike Norman said...

It sounds like you have run into some of the issues that the 'dynamic language guys' (JRuby, Scala) have hit w.r.t. class generation:
http://blog.headius.com/2008/09/first-taste-of-invokedynamic.html

The stuff about java.dyn.AnonymousClassLoader and MethodHandles just blows my mind!

Ismael Juma said...

Hi Ed,

You say that casting is very expensive. I am interested in how you benchmarked this. From my measurements, casting is actually relatively cheap (and HotSpot has intrinsics for the relevant operations).

On the other hand, boxing primitives to Objects can be relatively expensive. Is that perhaps what you measured? For example, if you were trying to add an int to an Object[], then indeed it's not surprising that it would be costly.

Ismael

Ed Merks said...

Ismael,

Relatively cheap is a relative term. When compared to calling a method or returning a field, it's relatively expensive. In the past, I found that casting on the IBM JRE was much much slower than the Sun JRE, but then significant improvements were made to the IBM JRE which brought them more in line. Many factors are involved. Casting to a class is cheaper than casting to an interface. The number of classes in the hierarchy even has an impact. Once you cast to something, the information about that cast is reused, making the next cast to the same thing cheaper (in some implementations). And of course each time a new JRE comes out, your assumptions might have to change again!

Typically I measure by running sample code in a tight loop and running it a huge number of times so that it has macroscopic cost. It's important that the code be warmed up so that the JIT has done its thing. And you have to be careful that the JIT isn't so smart that you're not actually measuring what you think you're measuring, but something quite a bit reduced.

Your comments are a good reminder that I should be sure to measure a more trivial array of slots with casting implementation against the one that's presumably faster because it avoids casting. I can always use bit flags to store the container feature ID, the eDeliver, and the mask for the form of the array of slots all in a single int...

Ismael Juma said...

Indeed Ed, it is because I am familiar with all of the JVM benchmarking pitfalls that I asked for specifics. :)

Last time I tested casting to an interface in an inner loop with JDK6u10, the difference was in the noise compared to no casting. This was with -server and the appropriate warm-up. So I was curious to see what exact scenario you had tested (in case you still had the code handy) to understand why the results were different.

Ismael

Ed Merks said...

Ismael,

You're comment though makes it sound like casting is effectively free, but that can't really be the case either. A JIT might be smart and decide that casting 1000 times is semantically the same as casting once and hence do it only once. Then again, folks do amazing things to make the JRE faster!

Measuring Java performance is a bit of nightmare. It's so hard to pin anything down exactly. Oh well, it's good to be reminded to measure the results in context. Measuring with multiple JREs will be a good idea too.

Ismael Juma said...

It depends on your definition of free. As I said, I was testing inner loop performance after a warm-up. That gives time for the JIT to optimise the code. There is certainly cost involved in optimising the code in the first place. However, the idea is that for cases where performance actually matters (i.e. there are many calls) the casts turn out to be cheap since the time it takes to optimise is a very small percentage of the time spent executing the code.

Agreed about it being a tricky area. Things are always changing and there are so many variables that affect things that it's hard to pin things down indeed.

Ismael

Ed Merks said...

Ismael,

I created a different implementation that uses an array instead; it's in a patch in the bugzilla. In my preliminary tests it seems faster when measuring the cost of EObject.eContainer(). I think that's largely because casting is not nearly as expensive as it used to be (just like you suggested) and, by keeping an int flags field on the object, there are fewer calls needed to the Storage object, calls that likely can't be in-lined.

I was reading your blog. There's some very interesting material in there!

Ismael Juma said...

Excellent Ed, nice work. :)

Glad you found the blog interesting. I finally joined the party last month. Very late, I know. ;)