index

Database Design

Post Database

The easiest way to show our database model is to sketch it out on a napkin.

        +----+
        | id |--------+
        +----+        |
                      |
        +-------------+-------------+
        | First Post                |
        +---------------------------+
        | Jan-01, 2013              |
        +---------------------------+
                      |
        +-------------+-------------+
        | Welcome to THIS blog!     |
        | This is your first post.  |
        +---------------------------+

There is a unique key, id, there follows a variable number of "headers" (shown here as two, title and date, but there can be more), followed by a body of any data that can range from being empty to some sanely large number.

There are two other ways of drawing this:

        +----+
        | id |--------+
        +----+        |
                      |
        +---------------------------+
        | title: First Post         |
        +---------------------------+
        | date: Jan-01, 2013        |
        +---------------------------+
                      |
        +---------------------------+
        | body:                     |
        +---------------------------+
        | Welcome to THIS blog!     |
        | This is your first post.  |
        +---------------------------+

and:

        +----+
        | id |--------+
        +----+        |
                      |
        +-------+-------------------+
        | title | First Post        |
        +-------+-------------------+
        | date  | Jan-01, 2013      |
        +-------+-------------------+
                      |
    +-------+---------------------------+
    | body  | Welcome to THIS blog!     |
    +-------+ This is your first post.  |
            +---------------------------+

Each post record can also be seen as an associated array:

        id	=> N
        title   => "First Post"
        date  	=> "Jan-01, 2013"
        body    => "Welcome..."

where id is an integer and all others are strings. We call this the post "definition".

The interesting thing about this is that the code is designed to adapt to the data. That is, the "header" portion can be added to at anytime (or can be removed of) without breaking anything. That is, a post "definition" can be:

        id	=> N
        title   => "First Post"
        subtile => "Automatically generated"
        date  	=> "Jan-01, 2013"
        status  => "New"
        body    => "Welcome..."

and everything runs fine (with defaults for missing data in old records). If the record definition changes to having the status removed for example, the status data of old records would simply be ignored.

Which brings us to the final look at the database table:

    +----+------+
    | id | body |
    +----+------+

where id is an auto-increment INT and body is a MEDIUMTEXT.

So, in actuality, the separating of body from the "headers" (title and date in these examples) is actually a conceptual one, as a post's body is defined as: one or more colon delimited word identifiers and text as a header, followed by a blank line followed by body text:

    +----+---------------------------------------------------------------+
    | id | title:First Post\ndate:Jan-01, 2013\n\nWelcome...\nThis is... |
    +----+---------------------------------------------------------------+

which is exactly how a post is stored in the database.

Although this design makes for really simple and flexible header data, it does not allow for direct queries of header data. Having all the headers each as a column in a table allows for queries on those headers — but that was not a requirement of our design.

Our database code knows nothing about format of the post data.

We cannot query our database on header data directly, as each record must be read in full, the headers extracted and then queried. With a small blog, with no more than several thousands of medium sized posts — as ours was designed to be — this is not really an issue because the overall code base is very small. However, with tens of thousands or more and with much larger post data sizes, the performance hit will be seen.

To allow for queries on header data — while maintaining the basic flexibility we have outlined here — it might be interesting to change the database to:

    +--------+---------+------+
    | userid | headers | body |
    +--------+---------+------+

which would look like:

    +----+-------------------------------------+------------------------+
    | id | title:First Post\ndate:Jan-01, 2013 | Welcome...\nThis is... |
    +----+-------------------------------------+------------------------+

This still does not allow for queries on single header fields, but still, with a very small code base, and with modest amounts of posts (less than tens of thousands), that would not be an issue.

There is one more step which would offer an improvement: the adding of two columns for the post title and date (while leaving room for a variable number of "sub-headers"), as all posts require a date and title header.

But that increases code complexity with negligible performance improvement.

Note that THIS was designed to be a single user website and post records do not have a "name" header, and no other header fields beyond title and date lends itself to queries.

Comments Database

The user comments database is just like the posts database with the addition on a column of entryid, which contains the id of the post that the comment is for. With a "comment definition" of:

        id	=> N
        entryid => N
        subject	=> "What the?"
        from	=> "Visitor Joe"
        date	=> "Jan-30, 2013"
        body	=> "This is just crazy!"

Visitors Database

The Visitors database is similar to the POST DATABASE. (See file VISITORS.)

This is what we call a "Visitor" record:

        +---------+
        | userid  |-------+
        +---------+       |
                          |
            +-------------+-------------+
            | Jihan                     |
            +---------------------------+
            | pickleberry               |
            +---------------------------+
                          |
            +-------------+-------------+
            | I am a great person.      |
            | I love lists and b-trees. |
            | I am 5 years old.         |
            +---------------------------+

The other two ways of drawing this are:

        +---------+
        | userid  |-------+
        +---------+       |
                          |
            +-------------+-------------+
            | name: Jihan               |
            +---------------------------+
            | pasword: pickleberry      |
            +---------------------------+
                          |
            +-------------+-------------+
            | body:                     |
            +-------------+-------------+
            | I am a great person.      |
            | I love lists and b-trees. |
            | I am 5 years old.         |
            +---------------------------+

and:

        +---------+
        | userid  |-------+
        +---------+       |
                          |
            +---------+-----------------+
            | name    | Jihan           |
            +---------+-----------------+
            | pasword | pickleberry     |
            +---------+-----------------+
                          |
         +------+--------------------------+
         | body | I am a great person.     |
         +------+ I love lists and b-trees.|
                | I am 5 years old.        |
                +--------------------------+

As each record has a "definition" in the form of name/value pairs:

        userid    => unique
        name      => string
        password  => crypt
        body      => data

where the value can be considered the name of a "function" that further defines the type of data that the can be stored. With borrowing from SQL and Regular Expressions, the values can also be looked at this way:

        userid    => AUTO_INCREMENT
        name      => [a-zA-Z0-9_ \.]
        password  => crypt([a-zA-Z0-9])
        body      => .*

Performance

We designed the database code to be "Data Ignorant". All the code does is read/write/delete records. The code is just an API and knows nothing about the content of the records. Just like the fopen/fread/fwrite/fclose API. It is the "higher up" code that knows, or cares, about the content.

About performance. Say we have 2,000 records. And we want to read only those that have a "title" header of "Title". Here is a query string:

        SELECT body FROM 'posts' WHERE title = $title

What we have to do, using our current code as is, is something like this:

        SELECT id FROM 'posts'
        foreach (id)
        	SELECT body FROM 'posts' WHERE id = $id
        	if (body['title'] == $title)
        		$records[] = body

We have to read all posts and check the title of each one, at the end we have an array, $records, with each post with that title. Looks quite wasteful!

But just what goes on behind the scene within MySQL? That is one for the ages. Because, what if MySQL, during it's "SELECT body FROM posts WHERE title = $title", had to do something like this:

        SELECT id FROM 'posts'
        foreach (id)
        	SELECT body FROM 'posts' WHERE id = $id
        	if (body['title'] == $title)
        		$records[] = body

MySQL, as all databases, stores it's own data as files, or within files. So, when performing the query for a "title" of "Title", does not MySQL have to read all posts and check the title of each one?

I need to know more before I will start using hundreds awful and intertwined and code bloated query strings to get around the bottleneck that is the filesystem.