tag:dataconstellation.com,2010:blogical/blogData Constellation - Blog2009-01-13T12:00ZClifford HeathWhy software tools fail and what's needed to succeedurn:uuid:df1a8319-fe4b-524f-8f80-2aeeb44813822009-01-13T12:00Z<p>Some of my perceptions of the social dynamic of the use of software
tools.</p>
<p>Designers of software are motivated mainly by kudos – if their
success might be put down to a tool, they aren’t as motivated to
use it. The thing that motivates them is to produce the nice software
(that’s already in their head) in the shortest possible time. They
like tools when they lessen the work without reducing the quality.
Tool output is often seen as a compromise and suboptimal, so there
really has to be a big time saving to impress these people. Existing
ORM and CASE design tools often haven’t produced the right artifacts
to shorten schedules – they’ve been targeted at doing things better
and getting them right more often. The developer’s hubris doesn’t
allow them to see this as an advantage, since they think if left
alone, they could produce perfect software without such tools, and
that’s what they want to be admired for.</p>
<p>People who want to be “top dog” in a development team, wielding
control in excess of their work output, like to use tools – because
their knowledge of the tool gives them a special place, they pull
the strings. But such people produce so little of the final artifacts
of a project they often contribute little to the success of projects
anyhow. Instead they create turmoil by insisting that things be
done their way, and redone even if a solution is already working.</p>
<p>These last two paragraphs explain why the CASE tool movement of the
1980’s failed so miserably, not because the tools didn’t work.</p>
<p>Organisations will not invest in software development using tools
from companies that may fail, or where the tool is seen as risky
or dead-end. A technology has to be established in order to provide
an escape route… but these days, a viable escape route is if the
tool is open source. Nothing is invested to get started, and no
vendor can take you down with them.</p>
<p>The IT function as a whole, is often viewed by the business as
having far too much control. In part this is a natural reality, as
the business can’t move forward without the IT changes, and they
don’t have enough understanding of the challenges of succeeding in
software development to trust IT. But on the other part, IT uses
its power to gain some control over the business direction, sometimes
with legitimacy but not always. So the IT function is further
distrusted. In addition, IT often fails to deliver adequate
functionality in a timely way, and so are seen as less competent
than other areas of the business.</p>
<p>On the other side, the business isn’t often much good at writing
specifications. The language used is too vague, and doesn’t reflect
an understanding of how the IT systems will support business changes;
because the business is concerned with what and why – as it should
be. So they employ business analysts, who are meant to bridge the
gap, but often fall too much on one side or the other. When IT try
to explain that a feature cannot be implemented, or is incompletely
specified, they have great trouble explaining why there’s a problem.
In part, that’s because they think in terms of how, since that’s
the natural tendency of the engineering mind. The failure of the
business to understand the problem is seen as legitimizing the
degree of control that IT asserts.</p>
<p>All this is down to communication and language. A semantic modelling
language must make it easier for the business and IT to work together,
not as opponents, engaging in paper warfare, but really collaborating.
The best way to do that is to create a single language that both
groups can read and write, that can provide both with what they
need – precision and consistency for the IT folk, and verifiability
against the business rules and process for the business folk.</p>
<p>In the process, the language can also be used to generate the
artifacts that both groups need (schemas, code, documentation of
business rules) – but it’s main attraction to the business is the
way it changes the communication process. The generated artifacts
are about reducing the project schedules while ensuring continuous
compliance with the specification – but they must be of high quality,
and preferably, the generators must be tweakable (open source).</p>
<p>That’s the language I hope CQL will become.</p>
There are no attributesurn:uuid:13bccefb-e3bc-5583-8e69-4fa484595f822008-04-23T12:00Z<p>Things don’t have attributes, they have relationships to other
things.</p>
<p>Programmers get taught to sort things into objects and their
attributes, but that isn’t always helpful. We tend to treat anything
we can write down as a value (like a name, a number, or a date) as
an attribute of something. But sometimes, a value identifies a
thing, and that thing might have other attributes. So the distinction
breaks down, and we have to rearrange. In a database, that can mean
a lot of extra work.</p>
<p>In a semantic model things don’t have attributes, they have
relationships. A relationship might be to a value, and perhaps a
given thing may allow only one value in that relationship. That
makes it seem like an attribute, but we need to keep those ideas
separate.</p>
<p>Is your birth date an attribute? No, it’s just a date to which you
have a special relationship – the birth date relationship. Other
people have other relationships to that date, and so do other things.
Somebody registered their car that same day. The same date plays
roles in other relationships, and those roles might even carry
meaning in relation to your birth date.</p>
<p>Because you have only one birthdate, it seems obvious to store it
as an attribute. That means it’s yours, and not intrinsically related
to other things. But what if you made the wrong decision about which
concept the birthdate belongs to?</p>
<p>Consider your birth place. Places have their own identity, just as
dates do. You only have one birth place, so that could be an attribute
too. But there were other people present at your birth… your mother
for instance. Your doctor, and nurses. If these things matter in a
database you’re designing, you might need to model the birth event.
Birth place and birth date are now seen to be attributes of your
birth event, not attributes of you at all. The other people involved
in your birth also have a relationship to that birth event.</p>
<p>Consider your given name… or is that names? Do the separate names
matter, or just the string of names joined up with spaces separating
them? That depends on what purpose you have in using the names. You
might want to be able to quickly find all the people who have “John”
as a middle name – then it might make sense to store the names
separately, not as an attribute of the person object.</p>
<p>The same reasoning follows for every kind of attribute. If you start
out by thinking about objects (entities) and their attributes, you
make assumptions about the way your data will be used.</p>
<p>Instead, start by thinking about how entities are related to values
(and to other entities), and make sure you have that clear in your
mind. Describe each relationship, using expressions such as “Person
was born in Birth” and “Birth is of Person”, “Birth occured on
birth-Date”, “Birth occured at birth-Place”, “Person (as Mother)
gave Birth”. A hyphen after each adjective will help keep the idea
of Place and Date separate from birth date and birth place, while
making it clear that birth date is a special role of a Date.</p>
<p>Continue this process of semantic modeling until you have described
most of the entities and values that matter, and the relationships
(fact types) that join them up.</p>
<p>As you go, you can also record the cardinality of each relationship:
“Birth was at exactly one birth- Place”. The “exactly one” isn’t
part of the relationship, it’s just a constraint over it. It limits
the cardinality of the relationship. Other constraint expressions
you might use are “at least one”, or “at most one”. Don’t forget
that when you say “exactly one”, or “at least one”, you will need
to always know the answer. If you ever need to store information
about a Birth but you might not know the birth date, say “at most
one”.</p>
<p>When you’re done, or nearly done, you’ll know whether you need a
separate Birth table in addition to the Person table, or a separate
GivenNames table. Of all the things that you’ve decided matter to
you, if Birth is only relevant to one of them (the Person) then
since there’s only one Birth per person, then you may be able to
absorb the birth date and place as columns of the Person table.
It’s a little more complicated than that, but not much.</p>
<p>This is one reason why semantic modeling works better than traditional
ER or UML modeling. It’s possible to make a complete model before
making decisions about which things are attributes and which aren’t.
It still won’t be your final model, but you’ve postponed some bad
decisions so you can make good ones instead.</p>
<p>Because of these rules about when you can absorb things, even small
changes or additions to your semantic model will cause changes in
which tables you need. The shape of your relational database will
always be more prone to change than the original semantic structure.
For example, a new requirement might be to link up the Birth to the
hospital records management system, to be used in paying the medical
staff who assisted. Suddenly the Birth details don’t look very
Personal any more – even though they haven’t changed!</p>
<p>But because your semantic model hasn’t changed much, you should be
able to get away with making only small changes in your code –
assuming you followed through properly! As long as the details of
the changed tables are hidden under a semantic layer. But that’s a
problem for another time…</p>
<p>In the meantime, restrain yourself making from early assignments
concerning attributes, and you’ll find you discover new meanings
in the information that embodies the rules of your business.</p>
Presentation on Treetopurn:uuid:6fd606f9-a967-54e0-aa27-dff30e3016d62008-02-01T12:00Z<p>Last night I presented Nathan Sobo’s excellent
<a href="http://github.com/nathansobo/treetop/">Treetop</a>
<a href="http://pdos.csail.mit.edu/~baford/packrat/popl04/peg-popl04.pdf">packrat parser</a>
generator to the Melbourne Ruby community. Some of the
material was directly adapted from
<a href="http://rubyconf2007.confreaks.com/d1t1p5_treetop.html">Nathan’s presentation</a>
(thanks Nathan!), but I took a different approach and show my own examples.</p>
<p>I’ve been using Treetop to construct a parser for
<a href="http://dataconstellation.com/ActiveFacts">CQL</a>, a very interesting
new development based on restricted natural language. CQL is for
data definition and query, and rolls together my natural language
approach with concepts from <a href="http://ormfoundation.org">ORM</a>,
<a href="http://en.wikipedia.org/wiki/Semantics_of_Business_Vocabulary_and_Rules">SBVR</a>,
Prolog, and the
<a href="http://en.wikipedia.org/wiki/Web_Ontology_Language">Web Ontology Language OWL</a>.
But I don’t present that language here.</p>
<p>If you’re interested in Treetop, you can download the 23MB <a href="/assets/2008/02/01/1-Treetop.mp3">MP3
audio file</a> and the <a href="/assets/2008/02/01/2-Treetop.pdf">PDF of my
presentation slides</a>.
The <a href="/assets/2008/02/01/3-arith.treetop">example code</a> and
<a href="/assets/2008/02/01/4-expr.rb">driver program</a> are also available.</p>
Out of Vietnam, Part 2urn:uuid:eb066cf6-c343-5676-b15d-e975480f09ce2007-11-15T12:00Z<p>I talked about how we build systems by composing them from elements,
not by decomposing monolithic “problem statements”. The elements
always depict either states, or transitions between states. These
two angles are the information perspective and the process perspective.
Process steps always transit between legal states, so the set of
legal states must be defined first. It’s not that information is
pre-eminent, but it does tend to lead rather than lag the rest of
the design. So here I’m focusing on the information part of our
design.</p>
<p>Now, we need to build up an overall aggregate picture of what things
our system can describe, and what it needs to know about those
things. The picture is made of many small elements, and many small
constraints on the ways they interact. The sum of these small things
forms our conceptual model – they reflect the way we and our clients
think about their problem, not its solution. To store them however,
we need to group them together for efficient management. That’s
what we’re doing when we’re building a database design – writing
down all the things that the system needs to know, in a way that
will be efficient to manage. There are two goals here: manage all
the elements and their interactions without losing track of any,
and produce an aggregate structure that is efficient. These goals
work against one another.</p>
<p>When we’re done, if we’ve done a good job, we have a normalised
database design. “Normalised” basically just means that it provides
only one way to represent any of the elementary facts, so that you
can’t have two versions that disagree. But there’s another property
of normalised data that causes problems: any one “thing” will only
have one record, and all facts about that thing for which there is
only one value at a given time are stored in that record. This
aggregation is a fine principle for creating efficient physical
storage structures, but the aggregation leaks into our code.</p>
<p>When we query the data using SQL, there’s one way of accessing a
fact that has only one value for each thing – the column – and
there’s another completely different and somewhat difficult way of
accessing facts that have more than one value for each thing – the
join. SQL forces the direct use of the physical database model,
while at the same time hiding the true domain model which is present
in the elementary form. This prevents the domain expert from properly
engaging in verifying the model and ensures communication problems
because of the translation and interpretation required.</p>
<p>Being bound to the physical model also tragically limits the agility
(evolution) of applications because the physical model is always
more unstable than the conceptual model. The mere number of values
(or other things) related to a thing in a given fact relationship
should be a minor detail, yet it completely controls the physical
model. When a requirement changes in tiny ways, we can sometimes
end up needing to do a major restructure of the database, potentially
across many tables.</p>
<p>Imagine you have a table of users, and one column is the “given
names” column. Your client now needs to store information about the
reason your parents gave you each name… and all of a sudden you
need to move the contents of that column out into a new “given
names” table. Every query that fetched a user will probably have
also fetched their given names, and so now needs to be rewritten.
All we did was add a new fact to an otherwise complete model – why
does all our code need to be checked and maybe rewritten? Ok, perhaps
“given names” is an uncommon example, but this sort of thing occurs
so often in relational databases that for more than twenty years,
it has its own name: attribute migration. It’s an example of just
one way the addition of a small item to our elementary model causes
big impacts on out aggregate design.</p>
<p>So while relational databases are one of the preeminent achievements
of computer science, they must move beyond requiring direct dependence
on the physical storage structures. SQL is the problem here, because
of the gross difference between accessing a column (single value)
and another table via a join.</p>
<p>Replacing SQL by a language that has this property of uniformity
of reference must be the top priority if the industry is to more
forward in solving this critical problem. There is a way out of
Vietnam… but only after we replace SQL. Tune in next time for a
first peek at the language that can do this, the Constellation Query
Language.</p>
Out of Vietnam, Part 1urn:uuid:017278d5-6fcd-5db1-8da0-214d858e1b4c2007-10-21T12:00Z<p>Object/Relational mapping has been called the Vietnam of Computer
Science, meaning, I think, that it’s become an intractable problem
that we never needed to get into in the first place. Actually, it
was unavoidable, but there’s a way of hiding the problem, which is
the subject of this series of articles.</p>
<p>The core of software design is expression; how do we express what
we want a system to do, to be, and to achieve? It’s hard for software
folk to think clearly about this. We’re conditioned by having
problems handed out on a sheet of paper during our training. We’re
taught to break them down, decompose them, by various methods. We
worry and argue about the right way to go about decomposing problems.</p>
<p>In reality, we never receive problems fully-formed like this, and
so we seldom have to decompose them. Instead, our clients witter
on about how this should have one of those, and how a thing is on
this list unless that condition holds… and we have to compose a
system out of these fragmentary utterances. Composition and
aggregation, not decomposition, is our main activity. In the process,
we try to distil and create conceptual purity from the original
communications.</p>
<p>In choosing how to aggregate things, we take various approaches.
Object-Oriented practitioners group things mostly by shared behaviour.
Database people struggle to avoid duplication while clustering
things to maximise disk throughput and transactional reliability.
In both cases, the attempt to maintain purity is moderated by the
need to work within the bounds of physical computer hardware – main
memory on the one hand, disk drives on the other. One is volatile,
the other persistent. These two place very different constraints
on the shape of an optimum solution. Both are based in the real
world, so the problem is to some extent unavoidable.</p>
<p>It gets worse though… neither solution is very close to the original
problem statement, which shuts out the domain expert. We actually
have not a two-way problem, but a three-way one, played by three
roles:</p>
<ol>
<li>The Business Analyst or domain expert</li>
<li>The Software Designer/architect</li>
<li>The Data Managers</li>
</ol>
<p>In general, none of these wants quite the same things or talks the
same language as the others, and none really accepts the other’s
view on things. Depending on who you ask, they’ll always point to
another group as being the origin of the communication problem. So
we have a stand-off, and rocks get thrown in all directions. This
is the most pernicious and costly communication problem in the
software industry.</p>
<p>To get out of Vietnam, we have to create a language in which all
three groups can be equally fluent, and which gives each group what
they need. We need a language of facts which is at once formal and
accessible, and which can be automatically and efficiently mapped
to objects and to normalised database designs. It must reflect
natural verbalisations, yet have an unambiguous meaning. It’s not
UML or Barker ER notation. Think it sounds too hard? Come back to
read the next instalment.</p>
How to ruin a Rails projecturn:uuid:eca23625-4e45-5344-a338-31d7982b6f9f2007-10-18T12:00Z<p>There are lots of ways to ruin any project. I’ve seen most of them
over the last few decades, but this year I’ve been called in to
salvage a series of Rails projects that were, well, off the rails,
in some ways that maybe special to Rails. So I’ll try to steer clear
of the ordinary foul-ups, and focus on the ones that Rails seems
to attract.</p>
<ol>
<li><p>We have four months before the website is needed, and Rails is
so productive that we don’t need to get started yet. We can deliver
the specifications in a couple of months or so, and everyone will
be ready to knock out the website in two weeks. Right. Let me know
how that goes, ok?</p></li>
<li><p>Databases suck, no-one wants to write SQL, and I can’t do all
my validations in it anyhow, so why should I do any? We’ll do things
the Rails Way and put all that stuff in the code where it’s easy.
After all, who needs a uniqueness constraint if the code always
checks for an existing record before inserting a new one, right?
Nothing can go wrong with that can it?</p></li>
<li><p>Indexes? Add them after users complain that the site is too slow –
even if it was obvious after a moment’s thought that they were
always going to be needed. MySQL is so bad at optimizing queries
that it might as well be forced to do full table scans it was
probably going to do anyway. And besides, it worked just fine with
the 5 test records I put in the test fixtures manually.</p></li>
<li><p>Performance doesn’t matter, so if the site is too slow, well,
at least it was quick to develop. And when the client urgently needs
a report that should take five seconds to produce, but because it’s
a five-way join and you didn’t add any indexes it times out in
Apache’s mod_proxy after the regulation five minutes, well, that’s
why you turn your mobile phone off at night and ensure you can never
be found online, right? That way you can get a good night’s sleep
while the client is tearing out his hair and losing his business.</p></li>
<li><p>Foreign keys. You don’t need the database to enforce them if you
get the code right. No need to actually take a look at the database
from time to time to see whether the invariants your code is supposed
to enforce are actually held. So when you later make administrative
changes and delete records that other ones refer to, well, ActiveRecord
is good about providing a nil that should do nothing, and if not,
well, there’s always an exception catcher to tell you your mistake.</p></li>
<li><p>Oh, yes, exceptions. The Rails log is full of them, but they’re
mostly from Chinese hackers trying to find hidden features, or
irrelevant little deadlocks or races that made some user redo their
work. No big deal, it only happens occasionally. No need to deploy
one of the nice plugins that send you email when you get an exception,
of course. That would just mean you’d have to go and find out why
it happened, and Rails exists to reduce boring work.</p></li>
<li><p>If it works for one user, it’ll work for hundreds, won’t it?
Transactions and locks are for banks, not for websites. And two-phase
commit, that’s engagement & marriage isn’t it, not something you’d
use in a payment protocol? Oh, and I sprinkled a few magic
Model.transaction {} blocks around the place, and they must work,
because people who should understand such things said they work.</p></li>
<li><p>Release management is for wimps. Just use the SVN trunk, and
when you check in code, check it out on the test server, let the
client look it over, then deploy it to production. No need even to
log in to do that, just cap deploy – you can do it without getting
out of your pyjamas. All your developers are demigods who never
make mistakes anyhow, so if one on one side of the city deploys the
other one’s code into production without even Skyping or picking
up the phone, there won’t be any unforeseen interactions, will there
now?</p></li>
<li><p>It was so easy to write, any fool can see it’s correct. TDD is
fine for some slow thinkers, and we’re glad Rails makes it easy for
them, but seriously, do you expect me to write 100 lines of code
to test 50, when I can see perfectly well that there aren’t any
errors in it? And besides, if there is an error, it’ll be a one-line
fix. Barely even need to finish my latte first, it’ll be fixed in
a moment. Not necessarily the moment before it makes the site melt
down, but that’s what backups are for, right?</p></li>
<li><p>Hmm, backups. That would have been a good idea. That would have
helped when, after discovering we hadn’t planned far enough ahead
to see the one feature that was going to make all the difference
on the big day, we let folk type data directly into the database
using an unvalidated, unlogged administration feature. Pity they
deleted the entire contents of a critical table… And even then, we
might have been able to cobble together a script to reconstruct the
transactions that were lost, except that the Rails log only lists
the form parameters, not the saved session variables that form the
context in which those parameters were relevant.</p></li>
</ol>
<p>Discipline? Who needs discipline or forethought when you’re agile?</p>
Welcome!urn:uuid:5d7d1d9b-4197-5f75-b2a6-65e46e4db89f2010-06-16T12:00Z<p>Welcome back, my friend, to the blog that never ends… though it does seem to go into abeyance for a year or two at a time, unfortunately!</p>
<p>My name is Clifford Heath, and I’m the founder and principal consultant at Data Constellation.</p>