TL;DR

Favour narrow, task-specific tables. Make much more use of 1:1 relationships between tables where shared primary key values also act as foreign keys.

This is about transactional/OLTP, not analytic/OLAP tables

It’s very important to understand that I’m talking about everyday transactional databases, not analytic databases. The data design concerns for the latter are very different.

All the things about all the things

Every project where I didn’t do the data design has had a handful of main tables that were anywhere from pretty wide to utterly ridiculous. The most egregious was a heavily-traficked crucial piece of 24/7/365 EDI infrastructure. Their table for tracking those transactions was so wide, they’d run into the column limit on SQL Server.

Every users table I’ve ever seen is an example of this.

It’s understandable, though! Central tables describe central concerns, and you have to track a lot of information about your central concerns. What is the alternative?

The thing about central pieces of your data design is that they’re just about impossible to change in practice. So it’s really important to get them right.

So let’s talk about what’s wrong with wide tables.

Rows are the fundamental chunks

The problem is that your database stores data in rows. Each row of each table is stored as a contiguous chunk on disk somewhere. Whenever you access anything from a row, the database has to retrieve the entire row from disk.

That’s it. All of the issues with wide tables arise from this simple fact.

Consider authenticating a user. You will need to retrieve the user’s password hash based on their username or email. Something like that. Maybe you also need a salt column. The point is, you need very little information to authenticate your user. When you do this using your wide table with every damn thing in the world about your user, you force the database to retrieve and cache a bunch of information you don’t need.

Those extraneous data have very real costs: the rows are kept in fixed-size pages. The bigger the rows in the pages, the fewer rows fit in a page. Those rows are tracked in data structures. The more data that has to be kept in those data structures, the fewer the rows that will fit in the nodes of that data structure, the larger and deeper they have to be, and consequently the slower they are.

Everywhere that data is used, the bloated rows just slow everything down.

Now consider when you’re doing different things with that user. Maybe you’re evaluating some aggregate data about your users to send them carefully-targeted marketing emails. Now, the login information in the users table is extraneous bloat.

Keep data together that you use together.

The simple solution is to group the columns that you use together in their own tables. Have a user_auth table for login and authentication. user_communications for email and social meda ids. user_stats for statistics, and so on.

Two points:

joins are expensive, but not all that expensive on today’s fast ssds. And any cost is usually more than made up for by more efficient cache usage and smaller rows; and
the rows of these tables stand in a 1:1 relation to each other. The way to represent that is to use the same primary key in all these tables.

The little-used 1:1 relationship

A primary key is a name. It designates a thing, whether an abstract thing within the data model or a real thing being modeled.

Names can be re-used. You won’t go to hell for it; I promise.

Here’s the magic thing that no-one ever seems to do: you can use the same primary key in different tables when those tables are about the same things. This is how you create multiple tables in a 1:1 relationship.

If you group your columns carefully, you won’t often have to, but you absolutely can join tables on their primary keys, if you ever need two chunks of information at the same time.

A primary key can also be a foreign key. In fact, you can create foreign key constraints between primary keys and it all works fine.

This sort of thing (Postgres):

create table
  user_logins(
    id bigint primary key generated always as identity,
    email text,
    password_hash text,
    salt text
);

create table
  user_communications(
    id bigint primary key references user_logins(id),
    twitter_id text,
    substack_id text
 );

  create table
    user_stats(
      id bigint primary key references user_logins(id),
      upvote_count int default 0,
      post_count int default 0
    );

It’s usually best to have one table you treat as primary, that must always exist, where you generate the shared primary key. It nearly always makes sense to use the login table for this. Then you have other tables reference the login table as in the example above.

Simplify your code

A lot of applications follow a Active Record pattern with a class that represents a table on disk. Note that if you follow this pattern, you will have smaller, simpler model classes.

If you think about it, these smaller tables and smaller model classes will be easier to understand and less duanting to change. Bonus!

Conclusion

That’s it! It’s a simple idea, it will make your databases faster and easier to deal with, and you should just do it. And even if you have some of these monstrous tables, you can start using the 1:1 table pattern in the future.