Wed
08
Feb 2012
I'm not SQL guru, but I have to use databases from time to time, whether at home or at work, for desktop or (mostly) web applications. Some time ago I discovered a trick which greatly optimizes performance of complex queries that contain subqueries. I used it again recently when coding new comments administration panel for this website.
In my previous job I had a case when after several months of work my system gathered so much data that it couldn't generate report in any reasonable time. After applying this optimization reports from same data were generated instantly.
I know for these who deal with databases every day this "trick" or "optimization" is probably very basic, but if you - just like me - do only simple database tasks, it can be new to you. Here are the details:
Let's say we code a blog CMS and we have two tables:
CREATE TABLE posts ( post_id int(11) NOT NULL AUTO_INCREMENT, title varchar(128) NOT NULL, content text NOT NULL, time datetime NOT NULL, PRIMARY KEY (post_id), KEY time_k (time) ); CREATE TABLE comments ( comment_id int(11) NOT NULL AUTO_INCREMENT, post_id int(11) NOT NULL, author varchar(32) NOT NULL, text text NOT NULL, time datetime NOT NULL, PRIMARY KEY (comment_id), KEY post_id_fk (post_id), CONSTRAINT post_id_fk FOREIGN KEY (post_id) REFERENCES posts (post_id) ON DELETE CASCADE ON UPDATE CASCADE );
We want to show all comments (regardless of referred post), from newest to oldest.
select comment_id, post_id, author, text, time from comments order by time desc;
Now we want to show also some information about the post each comment refers to. We have to join table comments with table posts.
select c.comment_id comment_id, c.post_id post_id, c.author comment_author, c.text comment_text, c.time comment_time, p.title post_title, p.time post_time from comments c join posts p on p.post_id = c.post_id order by c.time desc;
But then things get complicated as we want some more information about each post: number of comments, time of its first and last comment. Such information for a single post can be fetched with following query:
select count(*) count, min(time) first_time, max(time) last_time from comments where post_id = 1;
Now, how to join this with the query above? First idea is to put subqueries in "select" section. That will fetch desired information for each record:
select c.comment_id comment_id, c.post_id post_id, c.author comment_author, c.text comment_text, c.time comment_time, p.title post_title, p.time post_time, ( select count(*) from comments c2 where c2.post_id = p.post_id ) comment_count, ( select min(time) first_comment_time from comments c2 where c2.post_id = p.post_id ) first_comment_time, ( select max(time) last_comment_time from comments c2 where c2.post_id = p.post_id ) last_comment_time from comments c join posts p on p.post_id = c.post_id order by c.time desc;
This is the great killer of the performance. I've found that such subqueries in select section, which logically issue a new query for each returned row, execute very slowly in MySQL. It means the database probably doesn't optimize it as well as it could.
But we can optimize it manually to retrieve exactly same data much faster. The trick involves writing a subquery in "from" section instead of "select" section. This way the subquery will be logically executed only once and joined with the rest of the query like it was just another database table.
To implement this idea, we have to first code a query that will return same data like our subqueries above, but this time for all posts we are interested in, not for a single one:
select p.post_id post_id, count(*) comment_count, min(c.time) first_comment_time, max(c.time) last_comment_time from comments c join posts p on p.post_id = c.post_id group by c.post_id;
The final query is:
select c.comment_id comment_id, c.post_id post_id, c.author comment_author, c.text comment_text, c.time comment_time, p.title post_title, p.time post_time, sub.comment_count comment_count, sub.first_comment_time first_comment_time, sub.last_comment_time last_comment_time from comments c join posts p on p.post_id = c.post_id join ( select p.post_id post_id, count(*) comment_count, min(c.time) first_comment_time, max(c.time) last_comment_time from comments c join posts p on p.post_id = c.post_id group by c.post_id ) sub on sub.post_id = p.post_id order by c.time desc;