Tuesday, 21 October 2008

Hand-written and Generated Code: Never the Twain Shall Meet

There are many tools these days that generate code. Before writing such a tool, stop to consider if you really should be generating code in the first place. After all, you're generating code from a model---you might not think of it as a model, but that's what I call it because that's what it is---so if you have enough information to generate the code that realizes the behavior described by the model, you obviously have enough information to emulate the behavior of the model. Byte code has a cost. It causes bloat. Don't produce it if you don't have to. EMF, for example, can emulate an instance of an Ecore model, including a fully functional editor, without generating a single line of code; just trying invoking "Create Dynamic Instance..." on any EClass' pop-up. It's a cool thing.

If you do have a good reason to generate code, keep in mind that humans will read it. Hand writing bad code is unacceptable, but generating bad code is completely inexcusable. Have you ever seen generated code where every referenced class name is fully qualified? It's clearly the simplest way to avoid name collisions, but it seems disrespectful of the human reader. Generating code that isn't of hand-written quality gives generators a bad reputation so focus on creating a thing of beauty.

In the ideal world, generated code would be complete. It would never need to be sullied from its untouched pristine state. Technically, you would not even need to put it under source code control because of course you can always regenerate it. You'd want to be very careful to version the generator in that case though. And keep in mind that if you don't version control the generated code, all your clients will need to install the right version of the generator tools simply to produce a functional code base. Also, it will be more difficult to detect when changes in the generator produces code that's different from what you've been testing. Treating generated code as if it's ephemeral has definite appeal, but is something to consider carefully.

You've probably noticed that the world is typically not quite ideal, and often far from it. So it's often the case that clients need to tailor what's generated. Sometimes that's even the whole point: the generated code is just scaffolding or a starting point from which to hand code a complete application. It's typically important to be able to invoke the generator again if the input model for it changes. Because many generators will simply overwrite any files they generated that last time, keeping hand written changes separate is obviously important in that case. But many generators also support protected regions where users can write their code such that it will not be overwritten. EMF takes this design to the extreme, effectively inverting it, by marking all the regions that the generator may touch. I like to think it's a bright idea.

There are those who believe that one should never modify generated code. I'm not one of those people, though there are clear advantages to avoiding it. For example, it's really easy to see what you've written yourself verses what was generated for you. JDT's support for filters mitigates that advantage by supporting the same thing dynamically, i.e., hides everything marked @generated. More importantly, it's possible to delete all the generated stuff to do a clean sweep. That's probably the strongest reason. On the downside, more classes result in more bloat. Even an empty class will take close to 0.5k. Worse yet, if you can't anticipate which files a user will wish to specialize, you're liable to double the number of classes. For example, in the implementation of MOF that preceded EMF, for every EClass Foo, it would generate FooGen, Foo, FooGenImpl, and FooImpl, where Foo extends FooGen, FooGenImpl implements Foo, and FooImpl extends FooGenImpl and implements Foo. The whole design caused significant bloat and just looked very stilted; even in the public API was very clearly tainted by the fact a generator was being employed. It's import to realize that small droplets of bloat will tend add up...

So while some will argue that when it comes to hand written and generated code, never the twain shall meet. I think it's important to keep in mind that, as with most things in life, there are trade-offs to our design decisions . As such, it's more important to explain and understand all the considerations that should be taken into account when making a choice than it is to decide which specific choice is a best practice in general. After all, EMF's generator model generates both the Ecore model and itself, so we're not actually in a position to delete our generated code. We need it to bootstrap the environment. It's prickly problem.

So while it's often a good practice to separate generated code from hand written code, and it's not necessary to version control generated code in that case, these decisions come at a price.


ekke said...

for me its always the most important thing to decide which code is only generated, which is always enhanced by manual written code and which is partly enhanced manually.
in my projects are always all three kinds and (after the difficult decision which to use where) its easy to use it all with openArchitectureWare, beause I can define different outlets for generated or generated-with-protected-areas.
if there are only some parts of the code manually enhanced I prefer the protected regions and avoid the complex structure of IMPL classes / interfaces as you described above. If my software-design needs such a structure - OK, but if I bloat my code with these classes only for technical reasons because I'm generating code then I'm using protected areas instead.

Jan Köhnlein said...

Here are some of the experiences which made me avoid a mix generated and manually written code:

Mixing generated and hand-written code requires the target platform to support some kind of annotation or comment mechanism. This is not always the case, e.g. for Eclipse plug-in manifests. Furthermore, Eclipse gets quite confused if you change the manifest with an external generator. So I even prefer completely generated and completely manually written plug-ins.

A broken generator run can quickly mess up your whole workspace, making it very hard to recover your manual changes. Checking everything into a VCS before running the generator is not always an option.

Furthermore, many reconcilers I used - the programs that actually merge generated and existing code - work only half ways. If I have to check for each manual change if it's still there after generating code, the development process gets rather annoying. I might even loose all the agility of the generative process.

One more thing: Inheritance is not the only mechanism to integrate generated and hand-written code. Consider using dependency injection, generated hooks or callbacks, extension points etc. which will not bloat your codebase.

Ed Merks said...


Yes, things like plugin.xml and MANIFEST.MF don't support merge because we don't have a nice mechanism for marking things. They also don't support separating into a generated part and an non-generated part. So they suffer from the generate once problem. I saw a few examples at MDSD where files are generated once as placeholders for user changes. This seems similar and that's okay. But doubling the number of plugins, like doubling the number of classes, doesn't seem like an ideal approach to advocate as best for all cases.

When you're designing a generator and it could produce a mess, it's of course very frustrating to mess up a code base. Of course EMF's generator doesn't do that, though while developing it, we had to be more careful than we need to be today. Fortunately Eclipse keeps a history so recovering changes is annoying but not impossible. Keeping a backup zip is of course possible as well. But obviously separating the two is easiest from a disaster recovery point of view.

I've been using EMF's merging generator for many years, so I have great confidence that it produces only good results; the worst possible outcome is to overwrite code that's marked as owned by the generator but that I had intended to control myself. It's an exceedingly agile process that an uncounted number of people use every day...

The inheritance example was used because Ecore's purpose is to generate exactly such an API. When designing a DSL, the purpose is to generate infrastructure to realize that DSL's semantics, so all manner of good techniques are available. If it's possible to produce a good design that facilitates separation of generated and hand written code, I totally agree, that's all the better. But to argue that all generators should conform to this, including and in particular EMF's generator for Ecore, seems to me to be over zealous.