Erlang Answers

A Companion to Erlang Questions

EQ Archives

Authentication

Mnesia Ordered By Queries

Table of Contents

Introduction

Mnesia is not a relational database engine; it's better described as a distributed transactional key-value store. Relational databases do not have their feature set by accident, however, so periodically somebody asks the list about range queries:

Hi,

What's the simplest, most efficient way of expressing the following
sql query in mnesia:

"SELECT * FROM foo ORDER BY creation_date LIMIT 500,10"

?

Btw, I don't care about transactions.

Thanks,
Yariv

Asking for simultaneous simplicity and efficiency is certainly optimistic.

As with many complicated Mnesia problems, the solution involves doing manually what relational databases do automatically behind the scenes.  Thus, if your application mostly consists of complicated relational access patterns, you should consider another database engine rather than Mnesia.

Inefficient General Solution

An inefficient general solution consists of doing a full table scan, accumulating the records into a tree, and reading out the result at the end of the scan.  This is what a relational database engine will do for you behind the scenes assuming you have not indexed the column being ordered.  If the number of records in the table is not that large and the query infrequent this can be an acceptable solution.

The mnesia:select/4 function can be used to extract records from a table according to a match specification.  Match specifications are quite general; in this case, because all records are eligible (i.e., no WHERE clauses in the SQL) we will just select all records, but we will select 10 at a time to prevent exhausting available memory.

-module (testsort).
-export ([ limit/4 ]).

limit (Tab, Field, Offset, Number) ->
  limit (get_field_number (Tab, Field),
         Offset,
         Number,
         mnesia:select (Tab,
                        [ { mnesia:table_info (Tab, wild_pattern), [], [ '$_' ] } ],
                        10,
                        read),
         gb_sets:empty ()).

get_offset ([ Field | _ ], Field, N) -> N;
get_offset ([ _ | T ], Field, N) -> get_offset (T, Field, N + 1).

get_field_number (Tab, Field) ->
  get_offset (mnesia:table_info (Tab, attributes), Field, 2).

limit (_FieldNumber, Offset, Number, '$end_of_table', Tree) ->
  Candidates = [ Record || { _, Record } <- gb_sets:to_list (Tree) ],
  { _, Rest } = safe_split (Offset, Candidates),
  { Result, _ } = safe_split (Number, Rest),
  Result;
limit (FieldNumber, Offset, Number, { Results, Cont }, Tree) ->
  NewTree =
    lists:foldl (fun (Record, AccTree) ->
                   Key = { element (FieldNumber, Record), Record },
                   gb_sets:add (Key, AccTree)
                 end,
                 Tree,
                 Results),

  PrunedTree = prune_tree (NewTree, Offset + Number),

  limit (FieldNumber,
         Offset,
         Number,
         mnesia:select (Cont),
         PrunedTree).

prune_tree (Tree, Max) ->
  case gb_sets:size (Tree) > Max of
    true ->
      { _, NewTree } = gb_sets:take_largest (Tree),
      prune_tree (NewTree, Max);
    false ->
      Tree
  end.

safe_split (N, L) when length (L) >= N ->
  lists:split (N, L);
safe_split (_N, L) ->
  { L, [] }.

How it works:

  • limit/4 is the entry point.  The SQL query in the original post would correspond to testsort:limit (foo, creation_date, 500, 10).
  • get_field_number/2 computes the column of the specified field in the specified table.
  • the initial mnesia:select/4 call says to select all records all columns 10 records at a time.  depending upon your record size you can get a favorable space-time tradeoff by increasing the number of records fetched per select.
  • the select results are inserted into a tree.  the key for the tree is a tuple whose first element is the field being sorted, so that the keys are in the desired sort order (the comparison function here is Erlang term order).
  • via prune_tree/2, the temporary tree is sized so that it contains at most (Offset + Number) entries.
  • at the end of the full table scan, the first Offset results are discarded and then the next Number results are returned.

Efficient Solution: Primary Key

Efficiency can be significantly improved if the column in question is either the primary key or a prefix of the primary key, and the table type is ordered_set.  Relational database engines recognize this condition for you automatically during query planning and choose an efficient method for computing the result similar to what we shall do.

We will again use mnesia:select/4, but this time to seek to the first desired result, and then we will read the answer directly.  If the record is defined as

-record (foo, { creation_date, something, something_else }).

then creation_date is the primary key, so we will select in table order, as in this code listing.

-module (testsort2).
-export ([ limit/3 ]).

limit (Tab, Offset, Number) ->
  seek (Offset,
        Number,
        mnesia:select (Tab,
                       [ { mnesia:table_info (Tab, wild_pattern), [], [ '$_' ] } ],
                       10,
                       read)).

seek (_Offset, _Number, '$end_of_table') ->
  [];
seek (Offset, Number, X) when Offset =< 0 ->
  read (Number, X, []);
seek (Offset, Number, { Results, Cont }) ->
  NumResults = length (Results),
  case Offset > NumResults of
    true ->
      seek (Offset - NumResults, Number, mnesia:select (Cont));
    false ->
      { _, DontDrop } = lists:split (Offset, Results),
      Keep = lists:sublist (DontDrop, Number),
      read (Number - length (Keep), mnesia:select (Cont), [ Keep ])
  end.

read (Number, _, Acc) when Number =< 0 ->
  lists:foldl (fun erlang:'++'/2, [], Acc);
read (_Number, '$end_of_table', Acc) ->
  lists:foldl (fun erlang:'++'/2, [], Acc);
read (Number, { Results, Cont }, Acc) ->
  NumResults = length (Results),
  case Number > NumResults of
    true ->
      read (Number - NumResults, mnesia:select (Cont), [ Results | Acc ]);
    false ->
      { Keep, _ } = lists:split (Number, Results),
      lists:foldl (fun erlang:'++'/2, Keep, Acc)
  end.

Here's how it works:

  • limit/3 is the entry point.  It assumes the ordering desired is by the primary key.  The SQL in the original post would become limit (foo, 500, 10).
  • the initial mnesia:select/4 call says to select all records all columns 10 records at a time.  depending upon your record size you can get a favorable space-time tradeoff by increasing the number of records fetched per select.
  • the seek/3 consumes the output of the select until the desired offset is reached.
  • the read/3 consumes the output of the select until the desired limit is reached.

We can use this routine even if creation_date is only a prefix of the primary key, i.e., if it were the leading element of a tuple or list that is used as the primary key.  That is because Erlang term order for tuples and lists will compare the first element first, and so will sort in the desired fashion.

This solution is linear in (Offset + Count).  If the equivalent SQL query is slightly modified to

SELECT * FROM foo WHERE creation_date > X ORDER BY creation_date LIMIT 10

then the initial mnesia:select/4 can be made more efficient with a guard that constrains the primary key, e.g.,

mnesia:select (foo, [ { #foo { creation_date = '$1', _ = '_' }, [ { '>', '$1', X } ], [ '$_' ] } ], 10, read).

With this mnesia:select/4 initial call, the seek/3 portion of the processing can be skipped, and only the read/3 portion need be executed.  Update: actually it looks like ets does not optimize this condition, although tcerl does.  However for an ordered_set table ets:next/2 will return the next key in the table even if the key supplied as argument is not in the table, so this could be exploited to walk an ets table starting at a particular lower bound.  Also, if the Offset in the limit query is resulting from continuing a previous query (e.g., paging through results in a search engine), then an alternate strategy is to serialize the select continuation, repair it (via ets:repair_continuation/2), and reuse via mnesia:select/1, which will seek to the next key in the table similar to ets:next/2mnesia:select/1 will outperform repeated calls to ets:next/2 as Count increases, so generally the former is preferred.

Efficient General Solution: Secondary Sorted Index

If the column being sorted is not (a prefix of) the primary key than efficient operation will require creating a secondary sorted index.  This is done for you automatically by a relational database engine when you index the column appropriately.  Mnesia's built-in indices are not sorted so you will have to do this by creating another table.  For example if the record is

-record (foo, { id, stuff, creation_date }).

then assuming that foo is of type set or ordered_set, you could make an ordered_set table with records

-record (foo_cd_index, { creation_date_id_tuple, void = [] }).

The primary keys of foo_cd_index will be tuples of the form { creation_date, id }, and we do not care about the void value but mnesia requires at least 2 columns, so we populate with nil (which has a small external format representation).  When we insert an element F into foo, we insert a corresponding element

#foo_cd_index{ creation_date_id_tuple = { F#foo.creation_date, F#foo.id } }

into the index table.  Once the secondary index has been constructed, we can use the primary key techniques from the previous section on the secondary table, extract the identifiers, and then query the primary table to extract the records (if only the list of identifiers is to be returned, the join can be skipped, yet another condition that a typical relational database engine recognizes for you under the hood).

If foo is of type bag, we cannot make an "ordered_bag" table for the index because mnesia does not support such a type; however we can achieve something similar with records

-record (foo_cd_index, { creation_date_id_stuff, void = [] }).

Essentially we have duplicated the record but reordered the fields.  When we insert an element F into foo, we insert a corresponding element

#foo_cd_index{ creation_id_stuff = { F#foo.creation_date, F#foo.id, F#foo.stuff } }

This solution is space intensive because the bag table type does not allow us to refer to an element by a subset of its attributes.  However, once the secondary index table has been created we can use the primary key solution on the secondary table to extract the desired records directly, without having to join against the primary table.

 


 Share this article

Comments


 commented Mon, 05 Oct 2009 01:34:45 GMT

Excellent and very useful.  Thank you Paul.

I was trying to find a solution to using Dojo's JsonRestStore with a webmachine server backed on mnesia.  Using your solution made paging requests on a 100,000 record data set complete in roughly 200ms rather than 4s.  My original hack was using a naive and memory-risking lists:sublist approach on the while data set.  My second attempt would have looked something like the Inefficient General Solution you posted.

Again, thank you :)


Post a Comment


You must log in to post a comment.

About Me

MeMy name is Paul Mineiro. I'm an avid user of Erlang and an avid reader of the Erlang Questions mailing list. I am available for consulting work. I use purple alot on this site because it is my daughter's favorite color.

Powered By