Post Database Design
The easiest way to show our database model is to sketch it out on a napkin.
+----+ | id |--------+ +----+ | | +-------------+-------------+ | First Post | +---------------------------+ | Jan-01, 2013 | +---------------------------+ | +-------------+-------------+ | Welcome to THIS blog! | | This is your first post. | +---------------------------+
There is a unique key, id
, there follows a variable number of "headers" (shown here as two, title
and date
, but there can be more), followed by a body
of any data that can range from being empty to some sanely large number.
There are two other ways of drawing this:
+----+ | id |--------+ +----+ | | +---------------------------+ | title: First Post | +---------------------------+ | date: Jan-01, 2013 | +---------------------------+ | +---------------------------+ | body: | +---------------------------+ | Welcome to THIS blog! | | This is your first post. | +---------------------------+
and:
+----+ | id |--------+ +----+ | | +-------+-------------------+ | title | First Post | +-------+-------------------+ | date | Jan-01, 2013 | +-------+-------------------+ | +-------+---------------------------+ | body | Welcome to THIS blog! | +-------+ This is your first post. | +---------------------------+
Each post record can also be seen as an associated array:
id => N title => "First Post" date => "Jan-01, 2013" body => "Welcome..."
where id
is an integer and all others are strings. We call this the post "definition".
The interesting thing about this is that the code is designed to adapt to the data. That is, the "header" portion can be added to at anytime (or can be removed of) without breaking anything. That is, a post "definition" can be:
id => N title => "First Post" subtile => "Automatically generated" date => "Jan-01, 2013" status => "New" body => "Welcome..."
and everything runs fine (with defaults for missing data in old records). If the record definition changes to having the status
removed for example, the status data of old records would simply be ignored.
Which brings us to the final look at the database table:
+----+------+ | id | body | +----+------+
where id
is an auto-increment INT and body
is a MEDIUMTEXT.
So, in actuality, the separating of body
from the "headers" (title
and date
in these examples) is actually a conceptual one, as a post's body
is defined as: one or more colon delimited word identifiers and text as a header, followed by a blank line followed by body text:
+----+---------------------------------------------------------------+ | id | title:First Post\ndate:Jan-01, 2013\n\nWelcome...\nThis is... | +----+---------------------------------------------------------------+
which is exactly how a post is stored in the database.
Although this design makes for really simple and flexible header data, it does not allow for direct queries of header data. Having all the headers each as a column in a table allows for queries on those headers — but that was not a requirement of our design.
We cannot query our database on header data directly, as each record must be read in full, the headers extracted and then queried. With a small blog, with no more than several thousands of medium sized posts — as ours was designed to be — this is not really an issue because the overall code base is very small. However, with tens of thousands or more and with much larger post data sizes, the performance hit will be seen.
To allow for queries on header data — while maintaining the basic flexibility we have outlined here — it might be interesting to change the database to:
+--------+---------+------+ | userid | headers | body | +--------+---------+------+
which would look like:
+----+-------------------------------------+------------------------+ | id | title:First Post\ndate:Jan-01, 2013 | Welcome...\nThis is... | +----+-------------------------------------+------------------------+
This still does not allow for queries on single header fields, but still, with a very small code base, and with modest amounts of posts (less than tens of thousands), that would not be an issue.
There is one more step which would offer an improvement: the adding of two columns for the post title and date (while leaving room for a variable number of "sub-headers"), as all posts require a date and title header.
Note that this was designed to be a single user website and post records do not have a "name" header, and no other header fields beyond title and date lends itself to queries.
Comments Database
The user comments database is just like the posts database with the addition on a column of entryid
, which contains the id
of the post that the comment is for. With a "comment definition" of:
id => N entryid => N subject => "What the?" from => "Visitor Joe" date => "Jan-30, 2013" body => "This is just crazy!"
Visitors Database
The Visitors database is similar to the POST DATABASE. (See file DOC/VISITORS.)
This is what we call a "Visitor" record:
+---------+ | userid |-------+ +---------+ | | +-------------+-------------+ | Jihan | +---------------------------+ | pickleberry | +---------------------------+ | +-------------+-------------+ | I am a great person. | | I love lists and b-trees. | | I am 5 years old. | +---------------------------+
The other two ways of drawing this are:
+---------+ | userid |-------+ +---------+ | | +-------------+-------------+ | name: Jihan | +---------------------------+ | pasword: pickleberry | +---------------------------+ | +-------------+-------------+ | body: | +-------------+-------------+ | I am a great person. | | I love lists and b-trees. | | I am 5 years old. | +---------------------------+
and:
+---------+ | userid |-------+ +---------+ | | +---------+-----------------+ | name | Jihan | +---------+-----------------+ | pasword | pickleberry | +---------+-----------------+ | +------+--------------------------+ | body | I am a great person. | +------+ I love lists and b-trees.| | I am 5 years old. | +--------------------------+
As each record has a "definition" in the form of name/value pairs:
userid => unique name => string password => crypt body => data
where the value can be considered the name of a "function" that further defines the type of data that the can be stored. With borrowing from SQL and Regular Expressions, the values can also be looked at this way:
userid => AUTO_INCREMENT name => [a-zA-Z0-9_ \.] password => crypt([a-zA-Z0-9]) body => .*
Performance
Left unmentioned of the design is that we wanted the database code to be "Data Ignorant". All the code does is read/write/delete records. The code is just an API and knows nothing about the content of the records. Just like the fopen/fread/fwrite/fclose API. It is the "higher up" code that knows, or cares, about the content.
About performance. Say we have 2,000 records. And we want to read only those that have a "type" column of "page". Here is a query string:
SELECT body FROM 'posts' WHERE type = $page
What we have to do, using our current code as is, is something like this:
SELECT id FROM 'posts' foreach (id) SELECT body FROM 'posts' WHERE id = $id if (body[type] == $page) $records[] = body
for all posts and at the end we have an array, $records, with each pst of type $page. Looks quite wasteful!
But just what goes on behind the scene within MySQL? That is one for the ages. Because, what if MySQL, during it's "SELECT body FROM posts WHERE type = $page", had to do something like this:
SELECT id FROM 'posts' foreach (id) SELECT body FROM 'posts' WHERE id = $id if (body[type] == $page) $records[] = body
MySQL, as all databases, stores it's own data as files, or within files. When any code reads a record from a file, it can open the file and read just, say, the first 1024 bytes. So let's look at it this way, with the assumption that the "columns" that we want to query on are in that first 1024 byte block:
foreach record file read 1024 bytes as header if header has type = $page $records[] = read entire record
Which looks at least somewhat more efficient than reading the entire record file each time for all records.
But is that the way MySQL works? What does MySQL do if you add a "header column"? Does it reorganize and optimize each record for such reading? Perhaps, as there are indexes, and all sorts of other stuff available. But I need to know more before I will start using hundreds awful and intertwined and code bloated query strings to get around the bottleneck that is the filesystem.