Presentation on Treetop

posted by cjh, 01 February 2008

Last night I presented Nathan Sobo’s excellent Treetop packrat parser generator to the Melbourne Ruby community. Some of the material was directly adapted from Nathan’s presentation (thanks Nathan!), but I took a different approach and show my own examples.

I’ve been using Treetop to construct a parser for CQL, a very interesting new development based on restricted natural language. CQL is for data definition and query, and rolls together my natural language approach with concepts from ORM, SBVR, Prolog, and the Web Ontology Language OWL. But I don’t present that language here.

If you’re interested in Treetop, you can download the 23MB MP3 audio file and the PDF of my presentation slides. The example code and driver program are also available.

Out of Vietnam, Part 2

posted by cjh, 15 November 2007

I talked about how we build systems by composing them from elements, not by decomposing monolithic “problem statements”. The elements always depict either states, or transitions between states. These two angles are the information perspective and the process perspective. Process steps always transit between legal states, so the set of legal states must be defined first. It’s not that information is pre-eminent, but it does tend to lead rather than lag the rest of the design. So here I’m focusing on the information part of our design.

Now, we need to build up an overall aggregate picture of what things our system can describe, and what it needs to know about those things. The picture is made of many small elements, and many small constraints on the ways they interact. The sum of these small things forms our conceptual model - they reflect the way we and our clients think about their problem, not its solution. To store them however, we need to group them together for efficient management. That’s what we’re doing when we’re building a database design - writing down all the things that the system needs to know, in a way that will be efficient to manage. There are two goals here: manage all the elements and their interactions without losing track of any, and produce an aggregate structure that is efficient. These goals work against one another.

When we’re done, if we’ve done a good job, we have a normalised database design. “Normalised” basically just means that it provides only one way to represent any of the elementary facts, so that you can’t have two versions that disagree. But there’s another property of normalised data that causes problems: any one “thing” will only have one record, and all facts about that thing for which there is only one value at a given time are stored in that record. This aggregation is a fine principle for creating efficient physical storage structures, but the aggregation leaks into our code.

When we query the data using SQL, there’s one way of accessing a fact that has only one value for each thing - the column - and there’s another completely different and somewhat difficult way of accessing facts that have more than one value for each thing - the join. SQL forces the direct use of the physical database model, while at the same time hiding the true domain model which is present in the elementary form. This prevents the domain expert from properly engaging in verifying the model and ensures communication problems because of the translation and interpretation required.

Being bound to the physical model also tragically limits the agility (evolution) of applications because the physical model is always more unstable than the conceptual model. The mere number of values (or other things) related to a thing in a given fact relationship should be a minor detail, yet it completely controls the physical model. When a requirement changes in tiny ways, we can sometimes end up needing to do a major restructure of the database, potentially across many tables.

Imagine you have a table of users, and one column is the “given names” column. Your client now needs to store information about the reason your parents gave you each name… and all of a sudden you need to move the contents of that column out into a new “given names” table. Every query that fetched a user will probably have also fetched their given names, and so now needs to be rewritten. All we did was add a new fact to an otherwise complete model - why does all our code need to be checked and maybe rewritten? Ok, perhaps “given names” is an uncommon example, but this sort of thing occurs so often in relational databases that for more than twenty years, it has its own name: attribute migration. It’s an example of just one way the addition of a small item to our elementary model causes big impacts on out aggregate design.

So while relational databases are one of the preeminent achievements of computer science, they must move beyond requiring direct dependence on the physical storage structures. SQL is the problem here, because of the gross difference between accessing a column (single value) and another table via a join.

Replacing SQL by a language that has this property of uniformity of reference must be the top priority if the industry is to more forward in solving this critical problem. There is a way out of Vietnam… but only after we replace SQL. Tune in next time for a first peek at the language that can do this, the Constellation Query Language.

Out of Vietnam, Part 1

posted by cjh, 21 October 2007

Object/Relational mapping has been called the Vietnam of Computer Science, meaning, I think, that it’s become an intractable problem that we never needed to get into in the first place. Actually, it was unavoidable, but there’s a way of hiding the problem, which is the subject of this series of articles.

The core of software design is expression; how do we express what we want a system to do, to be, and to achieve? It’s hard for software folk to think clearly about this. We’re conditioned by having problems handed out on a sheet of paper during our training. We’re taught to break them down, decompose them, by various methods. We worry and argue about the right way to go about decomposing problems.

In reality, we never receive problems fully-formed like this, and so we seldom have to decompose them. Instead, our clients witter on about how this should have one of those, and how a thing is on this list unless that condition holds… and we have to compose a system out of these fragmentary utterances. Composition and aggregation, not decomposition, is our main activity. In the process, we try to distil and create conceptual purity from the original communications.

In choosing how to aggregate things, we take various approaches. Object-Oriented practitioners group things mostly by shared behaviour. Database people struggle to avoid duplication while clustering things to maximise disk throughput and transactional reliability. In both cases, the attempt to maintain purity is moderated by the need to work within the bounds of physical computer hardware - main memory on the one hand, disk drives on the other. One is volatile, the other persistent. These two place very different constraints on the shape of an optimum solution. Both are based in the real world, so the problem is to some extent unavoidable.

It gets worse though… neither solution is very close to the original problem statement, which shuts out the domain expert. We actually have not a two-way problem, but a three-way one, played by three roles:

  1. The Business Analyst or domain expert
  2. The Software Designer/architect
  3. The Data Managers

In general, none of these wants quite the same things or talks the same language as the others, and none really accepts the other’s view on things. Depending on who you ask, they’ll always point to another group as being the origin of the communication problem. So we have a stand-off, and rocks get thrown in all directions. This is the most pernicious and costly communication problem in the software industry.

To get out of Vietnam, we have to create a language in which all three groups can be equally fluent, and which gives each group what they need. We need a language of facts which is at once formal and accessible, and which can be automatically and efficiently mapped to objects and to normalised database designs. It must reflect natural verbalisations, yet have an unambiguous meaning. It’s not UML or Barker ER notation. Think it sounds too hard? Come back to read the next instalment.