EF6 TPH performance on Azure

Table Per Hierarchy is an interesting concept which allows the programmer to configure a hierarchy of classes to be saved to a single database table.

This is particulary interesting in cases where you want to model entities which have some common properties and some individual properties (class based) and you want to aggregate using the common properties, but still need to treat each class differently.

The concept of having one table for this is actually not a new one. It was much used in the olden days, burt was considered a bad practice for various reasons. One was space, a byte now is not a byte then. The other was performance, shifting 10 times the data you really need used to be much bigger deal as it is now. But the main reason is manageability - imagine writing mappers and checks for a table with 500 attributes.

Well times have changed, DB first is not really "in" anymore. The focus has been taken off of DB design as tools have evolved to a stage where the programmer needs not be burdened by the dirty details of the individual database whims. You can rely on ORM to do the actual DB work needed. Or can you?

In theory things are simple enough. The ORM has to traverse the object hierarchy and gradually build a table which includes any attributes found in the classes involved.

The following can be said about the resulting table:

  • all the properties of all the classes have to be stored in the table, so the set of columns is the total of all the column sets you'd need to store them in individual tables

  • most of the columns will be nullable

  • a discriminator that identifies the actual class of the instance must be defined

For plumbing (reading / writing) the entities the ORM needs to be concerned only with properties that are included in the class being read or written.

Take an example like this:

    public class PetsContext: DbContext
    {
        public DbSet<Pet> Pets { get; set; }
    }

    public abstract class Pet
    {
        public int Id { get; set; }
        public string Name { get; set; }
        public string YearOfBirth { get; set; }
        public char Sex { get; set; }
    }

    public abstract class Mamal : Pet
    {
        public virtual int Legs { get; }
    }

    public class Dog : Mamal
    {
        public bool Sterilised { get; set; }
        public override int Legs { get { return 4; } }
    }

    public class Cat : Mamal
    {
        public bool IndoorOnly { get; set; }
        public override int Legs { get { return 4; } }
    }

    public class Bird : Pet
    {
        public bool Flies { get; set; }
    }
}

Consider a simple query to list all pets:

var pets = context.Pets;

The resulting select statement issued to the DB looks like this:

SELECT 
    [Extent1].[Discriminator] AS [Discriminator], 
    [Extent1].[Id] AS [Id], 
    [Extent1].[Name] AS [Name], 
    [Extent1].[YearOfBirth] AS [YearOfBirth], 
    [Extent1].[Flies] AS [Flies], 
    [Extent1].[IndoorOnly] AS [IndoorOnly], 
    [Extent1].[Sterilised] AS [Sterilised]
FROM [dbo].[Pets] AS [Extent1]
WHERE [Extent1].[Discriminator] IN (N'Bird',N'Cat',N'Dog')

Everything is as you'd write it yourself, except maybe the discriminator part, where it's just good practice, as your code does not know how to handle anything else.

Now consider a slightly more complicated query:

var pets2 = context.Pets.Where(p => ((p is Dog) && (p as Dog).Sterilised == true) || ((p is Bird) && (p as Bird).Flies == false));

The query gets rapidly fatter:

SELECT 
    [Extent1].[Discriminator] AS [Discriminator], 
    [Extent1].[Id] AS [Id], 
    [Extent1].[Name] AS [Name], 
    [Extent1].[YearOfBirth] AS [YearOfBirth], 
    [Extent1].[Flies] AS [Flies], 
    [Extent1].[IndoorOnly] AS [IndoorOnly], 
    [Extent1].[Sterilised] AS [Sterilised]
FROM [dbo].[Pets] AS [Extent1]
WHERE ([Extent1].[Discriminator] IN (N'Bird',N'Cat',N'Dog')) AND ((([Extent1].[Discriminator] = N'Dog') AND (1 = (CASE WHEN ([Extent1].[Discriminator] = N'Dog') THEN [Extent1].[Sterilised] END))) OR (([Extent1].[Discriminator] = N'Bird') AND (0 = (CASE WHEN ([Extent1].[Discriminator] = N'Bird') THEN [Extent1].[Flies] END))))

In fact, when dealing with more complicated models, things may start getting out of hand. We had a project where we needed to select data from a TPH table with 50.000 records and left join it with aggregates over three tables containing around 1 million records. It worked with no problems, a bit slow as the data was big, but acceptable.

Until we deployed it to Azure. Then stuff really hit the fan. Timeouts, resource hogging all around. The generated queries grew to unthinkable lengths. And with deadlines being missed, there was nothing to do, except to try indexing the data in ways Azure suggested and in the end, preloading the aggregates and do in memory joins like in 1987. This helped a bit, not for extreme cases, but enough to pull through the next couple of months while the critical code was converted to plain SQL and fully prerendered Razor pages instead of AJAX DataTables.

We were just too deep in the project (it was running in production already) to get rid of TPH entirely.

So a word of warning. TPH in EF6 is neat. It's practical, it makes the code easier to read, easier to manage and maintain. But you need to test it thoroughly on real data before you make the final decision to use it. Really, take the most complicated case you can think of and run it. If your coffee gets cold while you wait for results, consider a more conservative approach.