Consolidating PostgreSQL Databases into Schemas

May 24, 2022

I had a legacy Multi-Tenant application where each customer account had data in its own PostgreSQL database and another database for common data and account management. I needed to fold all those databases into a single database, but maintain the data separation.

The solution was to use PostgreSQL Schemas, which is similar to table namespaces, and load each database into its own schema within a central database. This allowed us to move forward without a costly legacy application rewrite to a multi-customer table structure.

This approach is similar to “partitioned tables” which uses a set of sub-tables to store and access the data, separated by a key (month, id range or hash). Instead of sub-tables, the schema approach partitions by the customer id, and allows for a customer schema to be removed from the database without affecting other customer schemas.

Benefits of this approach are:

Single database connection can be used to access (and switch) schemas based on request, reducing overhead for connection pools.
Each connection can now access the current customer and common database with schema search paths.
Queries can join and union tables from across schemas for easier reporting
Logical Replication (pg_logical) acts per database, so only one replication instance is needed for the multi-schema database.

Setup the new Central Database

First, create an “app” user and “appdb” database for access. I also need the hstore extention in it. I also moved the database from an old server to a newer (more performant) one with the lastest PostgreSQL version.

createuser -U postgres --createdb app
createdb -U app appdb
psql appdb app
appdb=# create extension if not exists hstore with schema public;

Restoring PostgreSQL database into a different schema

The problem to work around is that you can’t restore a pg_dump file into a different schema directly. The data in each database is stored in the public schema. We also needed to minimize downtime.

One messy option is to rewrite the dump file, changing the “public” schema name to the customer schema. It requires dumping the database in plain text format, which isn’t practical for larger databases, and doesn’t leverage multiple dump/restore jobs for less downtime.

My approach was to restore the database as the “public” schema, then rename the schema to the original name of that database. The next problem was the extentions we need in the public schema would be moved away as well, so those had to be recreated.

First, we run the dump with 8 concurrent job, which stores the backup in a directory format, with tables as binary data. This is fast and efficient, but requires the backup to complete before the restore into the new database. I performed the dump/restore from the new database server.

pg_dump -U postgres --host olddb --dbname=cust01 --schema=public
        --jobs 8 --format=directory --file=cust01.dump

The --jobs option spawns 8 processes which open their own connection to the original database and downloads table data concurrently. When complete, restore locally into the public schema of appdb on the newdb server (localhost):

pg_restore -U postgres --host localhost --dbname=appdb
        --jobs 8 --format=directory cust01.dump

Once restored, I changed ownership and renamed the schema to the original database name.

psql appdb app
appdb=# reassign owned by cust01 to app;
appdb=# alter schema public rename to cust01;
appdb=# alter schema cust01 owner to app;
appdb=# create schema public;
appdb=# alter extension hstore set schema public;

I scripted these steps and ran for each customer database and finally the common database. Because each restore went into the public schema, I could not run multiple restores concurrently, I would have been able to run multiple dumps concurrently, if I didn’t saturate the bandwidth between the servers.

An alternative approach would be renaming the schema, then dump/restore to the new place, though attention to extension locations need to be made. I don’t remember why this approach wasn’t ideal.

Accessing Schemas from psql

Now that the common and customer databases are in separate schema in a single database, we have to reference the table names with the schema name prefix

appdb=# select * from common.customers;
appdb=# select * from cust01.inventory;

This works, but isn’t very friendly. Also, listing tables doesn’t work well. PostgreSQL has a “search_path” concept, which acts much like your PATH environment var does to search directories in a specific sequence to locate a command name. The “search_path” variable is a comma-separated list of schemas to search for tables. You can set it in your psql session:

appdb=# set search_path=cust01,common,public;
appdb=# select * from customers;  -- Finds table in cust01
appdb=# select * from inventory;  -- Finds table in common
appdb=# \d    -- You can now list all tables in your search_path!

I also script this so I can enter into a psql session already pointing to the proper path. My adb script takes the customer schema name and enters psql with the connection string format. It also sets up a custom prompt so I know which customer I’m working with.

#!/usr/bin/env bash
# Usage: adb customername
[ -f $HOME/.psqlrc.appdb ] && rm $HOME/.psqlrc.appdb
[ -f $HOME/.psqlrc ] && cp $HOME/.psqlrc $HOME/.psqlrc.appdb
echo "\\set PROMPT1 '%m:%/:$1%R%# '" >> $HOME/.psqlrc.appdb
export PSQLRC=$HOME/.psqlrc.appdb

psql "host=newhost user=app dbname=appdb options=--search_path=$1,common,public"
[ -f $HOME/.psqlrc.appdb ] && rm $HOME/.psqlrc.appdb

That gives me a prompt like this, showing me the db server (prod or dev), db name (again prod or dev), and the customer name.

newhost:appdb:cust01=#

Configuring your application

For PHP, you set your database connection string like the psql command above.

host=newdb port=5432 dbname=appdb user=app options='--client_encoding=UTF8 --search_path=public'

The pg_pconnect() call will return an existing connection if the connection strings match. Therefore, don’t put the full search_path on it (or at all). Perform the switch at the start of the transaction, usually based on domain name, Authorization header, or other request parameter.

pg_query("SET search_path TO cust01,common,public;")

Your framework may have a particular way to do this.

For Rails, you can use the schema path in your config/database.yml:

development:
  <<: *default
  database: appdev
  schema_search_path: cust01,common,public
  host: localhost

But at the start of each transaction, you must switch to the proper schema. This must be done at the start of each transaction (ApplicationController, Jobs, etc.)

ActiveRecord::Base.connection.execute("SET search_path to #{customer.dbname},common,public")