Actions

Difference between revisions of "Developer Area/Database design strategies"

From Mahara Wiki

< Developer Area
(Created page with "'''Work in progress''' I've created this page to try to consolidate Mahara's stock of database design strategies, when we have data structures that are tricky to map into a s...")
 
Line 6: Line 6:
  
 
=== Known trouble spots ===
 
=== Known trouble spots ===
 +
This section describes existing database structures in Mahara that are annoying to deal with.
 +
 +
==== Plugin tables (_installed, _cron, _event_subscriptions)====
 +
 +
'''Situation:''' Every Mahara plugin type has a set of standard tables called "<plugintype>_installed", "<plugintype>_config", "<plugintype>_cron", and "<plugintype>_event_subscriptions". Additionally, there are "core" tables that parallel some of these: "config", "cron", and "event_subscriptions".
 +
 +
'''The problem:''' It's annoying to try to read from all these tables. Whenev we want to look for event handlers, we have to retrieve the list of every plugin type, and then query all the plugin tables, and the core table.
 +
 +
'''Cause:''' We're trying to map the Plugin class hierarchy into a relational database. Relational databases don't natively handle inheritance.
 +
 +
==== Owners (usr, group, institution, system) ====
 +
 +
'''Situation:''' Several types of things in Mahara can be "owned" by some agent (views, collections, institutions). At first only users could own things, so the oldest tables just have an "owner" column, which is a foreign key to "usr.id". Then we added support for groups to own things, so we added a "group" column, which is a foreign key to "group.id", and made that and the "owner" column NULLable. Then we expanded to institutions owning things, so we added a third column, "institution", that maps to either "institution.id" or "institution.name" (if it's an older table).
 +
 +
'''The problem:''' Having three NULLable foreign keys makes it tricky to set up uniqueness constraints in a database cross-compatible way. (For instance, we need a uniqueness constraint on the view table for (owner, group, institution, urlid), but MySQL can't handle that three of those can be NULL). Also, queries (such as the one used by view_search) become complex, requiring outer joins to multiple tables, and sometimes CASE statements to equate things like "group.name", "institution.displayname", etc.
 +
 +
It's also conceivable that we might want to expand to ownership by other sorts of entities in the future, which is difficult to do with this structure.
 +
 +
'''The cause:''' I believe this is an example of polymorphism. In object-oriented terms, "view", "collection", and "institution" are all implementing a shared interface (call it "owner"). Again, not something that Relational databases can natively handle.
 +
 +
==== Viewers (view_access) ====
 +
 +
'''Situation:''' Similar to the "owners" situation, there are several classes of agents that can be the recipient of a permission stored in the "view_access" table: individual users, individual groups, "friends" of the page owner, all logged-in users, everyone using a specific "secret url", and everyone in the world. We've implemented this by having a LOT of different columns in the "view_access" table: accesstype, group, role, usr, institution, and token.
 +
 +
'''The problem:''' It makes view permissions queries very complex, and it's difficult to make changes to the permission scheme.
  
 
=== Suggested strategies ===
 
=== Suggested strategies ===
Line 15: Line 40:
 
You should read the essay to fully understand what they mean, but I'll give an example, using a simplified version of our own Plugin hierarchy.
 
You should read the essay to fully understand what they mean, but I'll give an example, using a simplified version of our own Plugin hierarchy.
  
In PHP, we have a base "Plugin" class, and each plugin will have these attributes stored in the database: name, version, release, active. We then have several subclasses: PluginArtefact, PluginAuth, PluginBlocktype, etc. Most of these have exactly the same attributes as the base Plugin class, but the PluginBlocktype has the additional (NULLable) attribute of "artefact".
+
In PHP, we have a base abstract "Plugin" class, and each plugin will have these attributes stored in the database: name, version, release, active. We then have several subclasses: PluginArtefact, PluginAuth, PluginBlocktype, etc. Most of these have exactly the same attributes as the base Plugin class, but the PluginBlocktype has the additional (NULLable) attribute of "artefact".
 +
 
 +
1. "One table per hierarchy". You store everything about all the classes, in one table. This is a bit like what we with the view_access table. For plugins, this would be one "plugin_installed" table with columns: name, plugintype, version, release, active, artefact.
 +
 
 +
2. "One table per concrete class". You have a completely separate table for each (non-abstract) class, containing all the fields of that class, inherited or otherwise. This is what we currently have! So you have tables "artefact_installed", "auth_installed", "blocktype_installed", etc. Most have the same columns, but blocktype_installed also has an "artefact" column.
 +
 
 +
3. "One table per class". In this strategy, you make a separate table for each class, including the abstract ones. You start by creating a table for the parent class, with all of the parent class's fields. Then you create a table for each subclass, and the only columns the subclass tables have, are a foreign key to the parent class (and this foreign key should double as the child class table's primary key), and any fields unique to the subclass.
 +
 
 +
So for plugins, we would have a root "plugin_installed" table with columns: name, plugintype, version, release, active. Then we would have several plugintype tables with only a "name" column: artefact, auth, export, import, etc. The "blocktype" table would have columns "name" and "artefact".
 +
 
 +
4. "Generic schema". This is essentially taking database normalization to its extreme. You have a table that stores a list of classes; another table that stores attribute names and types; another field that maps attributes to classes; another table that maps values to attributes in instances of a class, etc.
  
1. "One table per hierarchy". You store everything about all the classes, in one table
+
This has a lot of query complexity and doesn't scale well, but it's extremely flexible. We do kind of a watered down version of this with our "config" tables.

Revision as of 00:56, 4 December 2015

Work in progress

I've created this page to try to consolidate Mahara's stock of database design strategies, when we have data structures that are tricky to map into a standard relational database.

See https://bugs.launchpad.net/mahara/+bug/1522309

Known trouble spots

This section describes existing database structures in Mahara that are annoying to deal with.

Plugin tables (_installed, _cron, _event_subscriptions)

Situation: Every Mahara plugin type has a set of standard tables called "<plugintype>_installed", "<plugintype>_config", "<plugintype>_cron", and "<plugintype>_event_subscriptions". Additionally, there are "core" tables that parallel some of these: "config", "cron", and "event_subscriptions".

The problem: It's annoying to try to read from all these tables. Whenev we want to look for event handlers, we have to retrieve the list of every plugin type, and then query all the plugin tables, and the core table.

Cause: We're trying to map the Plugin class hierarchy into a relational database. Relational databases don't natively handle inheritance.

Owners (usr, group, institution, system)

Situation: Several types of things in Mahara can be "owned" by some agent (views, collections, institutions). At first only users could own things, so the oldest tables just have an "owner" column, which is a foreign key to "usr.id". Then we added support for groups to own things, so we added a "group" column, which is a foreign key to "group.id", and made that and the "owner" column NULLable. Then we expanded to institutions owning things, so we added a third column, "institution", that maps to either "institution.id" or "institution.name" (if it's an older table).

The problem: Having three NULLable foreign keys makes it tricky to set up uniqueness constraints in a database cross-compatible way. (For instance, we need a uniqueness constraint on the view table for (owner, group, institution, urlid), but MySQL can't handle that three of those can be NULL). Also, queries (such as the one used by view_search) become complex, requiring outer joins to multiple tables, and sometimes CASE statements to equate things like "group.name", "institution.displayname", etc.

It's also conceivable that we might want to expand to ownership by other sorts of entities in the future, which is difficult to do with this structure.

The cause: I believe this is an example of polymorphism. In object-oriented terms, "view", "collection", and "institution" are all implementing a shared interface (call it "owner"). Again, not something that Relational databases can natively handle.

Viewers (view_access)

Situation: Similar to the "owners" situation, there are several classes of agents that can be the recipient of a permission stored in the "view_access" table: individual users, individual groups, "friends" of the page owner, all logged-in users, everyone using a specific "secret url", and everyone in the world. We've implemented this by having a LOT of different columns in the "view_access" table: accesstype, group, role, usr, institution, and token.

The problem: It makes view permissions queries very complex, and it's difficult to make changes to the permission scheme.

Suggested strategies

Here's an essay the describes four well-known strategies for dealing with inheritance in the database, and has a table describing their advantages and disadvantages and suggesting when to use them: http://www.agiledata.org/essays/mappingObjects.html#ComparingTheStrategies

This source calls them "One table per hierarchy", "One table per concrete class", "one table per class", and "generic schema".

You should read the essay to fully understand what they mean, but I'll give an example, using a simplified version of our own Plugin hierarchy.

In PHP, we have a base abstract "Plugin" class, and each plugin will have these attributes stored in the database: name, version, release, active. We then have several subclasses: PluginArtefact, PluginAuth, PluginBlocktype, etc. Most of these have exactly the same attributes as the base Plugin class, but the PluginBlocktype has the additional (NULLable) attribute of "artefact".

1. "One table per hierarchy". You store everything about all the classes, in one table. This is a bit like what we with the view_access table. For plugins, this would be one "plugin_installed" table with columns: name, plugintype, version, release, active, artefact.

2. "One table per concrete class". You have a completely separate table for each (non-abstract) class, containing all the fields of that class, inherited or otherwise. This is what we currently have! So you have tables "artefact_installed", "auth_installed", "blocktype_installed", etc. Most have the same columns, but blocktype_installed also has an "artefact" column.

3. "One table per class". In this strategy, you make a separate table for each class, including the abstract ones. You start by creating a table for the parent class, with all of the parent class's fields. Then you create a table for each subclass, and the only columns the subclass tables have, are a foreign key to the parent class (and this foreign key should double as the child class table's primary key), and any fields unique to the subclass.

So for plugins, we would have a root "plugin_installed" table with columns: name, plugintype, version, release, active. Then we would have several plugintype tables with only a "name" column: artefact, auth, export, import, etc. The "blocktype" table would have columns "name" and "artefact".

4. "Generic schema". This is essentially taking database normalization to its extreme. You have a table that stores a list of classes; another table that stores attribute names and types; another field that maps attributes to classes; another table that maps values to attributes in instances of a class, etc.

This has a lot of query complexity and doesn't scale well, but it's extremely flexible. We do kind of a watered down version of this with our "config" tables.