When working with versioned records in a SQL database, it's common to need a query that retrieves only the latest version of each record. Suppose we have a table called post_revisions with the following structure:

CREATE TABLE post_revisions (
    id INT AUTO_INCREMENT PRIMARY KEY,
    post_id INT NOT NULL,
    revision_number INT NOT NULL,
    UNIQUE KEY (post_id, revision_number)
);

Each post_id can have multiple revisions, and we want to retrieve only the id of the record with the highest revision_number for each post_id.

Using a Subquery with JOIN

One of the most efficient ways to achieve this is by using a JOIN with a subquery:

SELECT pr.id
FROM post_revisions pr
JOIN (
    SELECT post_id, MAX(revision_number) AS max_version
    FROM post_revisions
    GROUP BY post_id
) latest ON pr.post_id = latest.post_id AND pr.revision_number = latest.max_version;

Explanation:

  1. The subquery (latest) retrieves the highest revision_number for each post_id.
  2. We then JOIN this result back to the post_revisions table to get the corresponding id.
  3. The result will contain only the id values of the latest versions for each post.

Alternative: Using a WHERE clause with a correlated subquery

Another approach is to use a correlated subquery in the WHERE clause:

SELECT id 
FROM post_revisions pr
WHERE revision_number = (
    SELECT MAX(revision_number) 
    FROM post_revisions 
    WHERE post_id = pr.post_id
);

Comparison:

  • The JOIN approach is generally more efficient for large datasets since it avoids multiple subquery executions.
  • The WHERE clause approach is easier to read but may perform worse in some cases, depending on indexing and dataset size.

Alternative: using CTE (common table expressions)

You can also use common table expressions to get the same result:

WITH post_revisions AS (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY post_id ORDER BY revision_number DESC) as pid
    FROM post_revisions
)
SELECT * FROM post_revisions WHERE pid = 1

Performance Considerations

To optimize these queries, ensure that you have the following indexes:

CREATE INDEX idx_post_id_version ON post_revisions (post_id, revision_number);

This allows the database to efficiently look up the latest version for each post_id, improving query performance.

Conclusion

When retrieving the latest version of records in a SQL database, using a JOIN with a subquery is often the best approach for performance and readability. However, for simpler cases, a correlated subquery may also work well. Indexing your post_id and revision_number columns will further enhance query performance.

Do you have other approaches or optimizations? Feel free to share your thoughts!