Developer Area/Database design strategies
From Mahara Wiki
< Developer Area
Work in progress
I've created this page to try to consolidate Mahara's stock of database design strategies, when we have data structures that are tricky to map into a standard relational database.
See https://bugs.launchpad.net/mahara/+bug/1522309
Known trouble spots
This section describes existing database structures in Mahara that are annoying to deal with.
Plugin tables (_installed, _cron, _event_subscriptions)
Situation: Every Mahara plugin type has a set of standard tables called "<plugintype>_installed", "<plugintype>_config", "<plugintype>_cron", and "<plugintype>_event_subscriptions". Additionally, there are "core" tables that parallel some of these: "config", "cron", and "event_subscriptions".
The problem: It's annoying to try to read from all these tables. Whenev we want to look for event handlers, we have to retrieve the list of every plugin type, and then query all the plugin tables, and the core table.
Cause: We're trying to map the Plugin class hierarchy into a relational database. Relational databases don't natively handle inheritance.
Owners (usr, group, institution, system)
Situation: Several types of things in Mahara can be "owned" by some agent (views, collections, institutions). At first only users could own things, so the oldest tables just have an "owner" column, which is a foreign key to "usr.id". Then we added support for groups to own things, so we added a "group" column, which is a foreign key to "group.id", and made that and the "owner" column NULLable. Then we expanded to institutions owning things, so we added a third column, "institution", that maps to either "institution.id" or "institution.name" (if it's an older table).
The problem: Having three NULLable foreign keys makes it tricky to set up uniqueness constraints in a database cross-compatible way. (For instance, we need a uniqueness constraint on the view table for (owner, group, institution, urlid), but MySQL can't handle that three of those can be NULL). Also, queries (such as the one used by view_search) become complex, requiring outer joins to multiple tables, and sometimes CASE statements to equate things like "group.name", "institution.displayname", etc.
It's also conceivable that we might want to expand to ownership by other sorts of entities in the future, which is difficult to do with this structure.
The cause: I believe this is an example of polymorphism. In object-oriented terms, "view", "collection", and "institution" are all implementing a shared interface (call it "owner"). Again, not something that Relational databases can natively handle.
Viewers (view_access)
Situation: Similar to the "owners" situation, there are several classes of agents that can be the recipient of a permission stored in the "view_access" table: individual users, individual groups, "friends" of the page owner, all logged-in users, everyone using a specific "secret url", and everyone in the world. We've implemented this by having a LOT of different columns in the "view_access" table: accesstype, group, role, usr, institution, and token.
The problem: It makes view permissions queries very complex, and it's difficult to make changes to the permission scheme.
Suggested strategies
Here's an essay the describes four well-known strategies for dealing with inheritance in the database, and has a table describing their advantages and disadvantages and suggesting when to use them: http://www.agiledata.org/essays/mappingObjects.html#ComparingTheStrategies
This source calls them "One table per hierarchy", "One table per concrete class", "one table per class", and "generic schema".
You should read the essay to fully understand what they mean, but I'll give an example, using a simplified version of our own Plugin hierarchy.
In PHP, we have a base abstract "Plugin" class, and each plugin will have these attributes stored in the database: name, version, release, active. We then have several subclasses: PluginArtefact, PluginAuth, PluginBlocktype, etc. Most of these have exactly the same attributes as the base Plugin class, but the PluginBlocktype has the additional (NULLable) attribute of "artefact".
1. "One table per hierarchy". You store everything about all the classes, in one table. This is a bit like what we with the view_access table. For plugins, this would be one "plugin_installed" table with columns: name, plugintype, version, release, active, artefact.
2. "One table per concrete class". You have a completely separate table for each (non-abstract) class, containing all the fields of that class, inherited or otherwise. This is what we currently have! So you have tables "artefact_installed", "auth_installed", "blocktype_installed", etc. Most have the same columns, but blocktype_installed also has an "artefact" column.
3. "One table per class". In this strategy, you make a separate table for each class, including the abstract ones. You start by creating a table for the parent class, with all of the parent class's fields. Then you create a table for each subclass, and the only columns the subclass tables have, are a foreign key to the parent class (and this foreign key should double as the child class table's primary key), and any fields unique to the subclass.
So for plugins, we would have a root "plugin_installed" table with columns: name, version, release, active. Then we would have several plugintype tables with only a "name" column: artefact, auth, export, import, etc. The "blocktype" table would have columns "name" and "artefact".
4. "Generic schema". This is essentially taking database normalization to its extreme. You have a table that stores a list of classes; another table that stores attribute names and types; another field that maps attributes to classes; another table that maps values to attributes in instances of a class, etc.
This has a lot of query complexity and doesn't scale well, but it's extremely flexible. We do kind of a watered down version of this with our "config" tables.
Here's a stackexchange about it. The serious DBA people seem to like the "one table per class" approach the most, because it's the most normalized (well, except for the super-normalized generic schema.) They want you to normalize it all and let the database decide how to optimize it: http://stackoverflow.com/questions/554522/something-like-inheritance-in-database-design
Fascinatingly, Postgres actually supports table inheritance directly! http://www.postgresql.org/docs/8.3/static/ddl-inherit.html
It's also worth noting that all these plans can be supplemented by views. For instance, the "one table per concrete class" approach splits data among several tables. But we could create a view called "v_plugins_install" that stores a query to UNION together the data from all those tables, in order to make read operations easier.
Another stackexchange article. Apparently in DB circles this is known as the "gen-spec" design pattern, for "generalization/specialization". Again they seem to favor the "One Table Per Class" approach. An interesting point is that, rather than having an explicit "type" column, you use a record's presence in a particular table, to indicate its type. http://stackoverflow.com/questions/4361381/how-do-we-implement-an-is-a-relationship
Another one, which describes some of the patterns we're currently using. It's specifically focusing on "polymorphic columns", which are when one column is a foreign key to several other tables (like the Moodle context table's "id" column).
It appears the reason this is bad, is because you are essentially using a database column (context_type) to represent a database table. And using data columns to reference metadata is not part of the design of SQL, so it tends to cause problems. Most notably, you can't have a foreign key constraint on a polymorphic column (something no one cares about at Moodle because they don't use DB-side foreign key constraints... perhaps because of the context table.)
Some of the alternatives to polymorphic columns:
- Our current multi-column "owner" solution is the "Exclusive Arcs" strategy. You have a separate NULLable foreign key column for each other table.
- Again there's the "One Table Per Class" approach, but this time called "Concrete Supertable".
- And third, the "Reverse the relationship" strategy, where you have a separate many-to-many mapping table for each table that would otherwise store an ID in the polymorphic column.
- So for instance, in the "view" table, we'd drop the "owner", "group", and "institution" columns. Then we'd create tables "group_owns_view" (id, groupid, viewid), "user_owns_view" (id, userid, viewid), and "institution_owns_view" (id, groupid, institutionid).
- Although you know... I think the problem there is that there's no SQL constraint that would let you enforce that each view MUST be owned by at least one entity, with that system.
- Unless you combined it with the "One Table Per Class" strategy. Have a "viewowner" table, that is just a sequence of IDs. Then have "*_owns_view" use a viewowner ID as their primary/foreign key, and view.owner becomes a foreign key to viewowner.id.
I dunno... the problem is, all these elegant normalized database structures are elegant and normalized, but they also add a lot more complexity, a lot more inserts and updates and deletions, and they make the database structure more puzzling for new developers.
We'll have to keep in mind, a simple database structure which can be comprehended easily by the uninitiated, is a good thing.