Production Considerations

When it comes time to deploy your PostGraphile application in production, there's a few things you'll want to think about including topics such as logging, security and stability. This article outlines some of the issues you might face, and how to solve them.

Database Access Considerations

PostGraphile is just a node app / middleware, so you can deploy it to any number of places: Heroku, Now.sh, a VM, a container such as Docker, or of course onto bare metal. Typically you won't run PostGraphile on the same hardware/container/VM as the database, so PostGraphile needs to be able to connect to your database without you putting your DB at risk.

A standard way of doing this is to put the DB behind a firewall. However, if you're using a system like Heroku or Now.sh you probably can't do that, so instead you must make your DB accessible to the internet. When doing so here are a few things we recommend:

  1. Only allow connections over SSL (force_ssl setting)
  2. Use a secure username (not root, admin, postgres, etc which are all fairly commonly used)
  3. Use a super secure password; you can use a command like this to generate one: openssl rand -base64 30 | tr '+/' '-_'
  4. Use a non-standard port for your PostgreSQL server if you can (pick a random port number)
  5. Use a hard-to-guess hostname, and never reveal the hostname to anyone who doesn't need to know it
  6. If possible, limit the IP addresses that can connect to your DB to be those of your hosting provider.

Heroku have some instructions on making RDS available for use under Heroku which should also work for Now.sh or any other service: https://devcenter.heroku.com/articles/amazon-rds

Common Middleware Considerations

In a production app, you typically want to add a few common enhancements, e.g.

  • Logging
  • Gzip or similar compression
  • Security protections
  • Rate limiting

Since there's already a lot of options and opinions in this space, and they're not directly related to the problem of serving GraphQL from your PostgreSQL database, PostGraphile does not include these things by default. We recommend that you use something like Express middlewares to implement these common requirements. This is why we recommend using PostGraphile as a library for production usage.

Picking the Express (or similar) middlewares that work for you is beyond the scope of this article; below is an example of where to place these middlewares.

const express = require("express");
const { postgraphile } = require("postgraphile");

const app = express();

/* Example middleware you might want to put in front of PostGraphile */
// app.use(require('morgan')(...));
// app.use(require('compression')({...}));
// app.use(require('helmet')({...}));

app.use(postgraphile(process.env.DATABASE_URL || "postgres:///"));

app.listen(process.env.PORT || 3000);

Denial of Service Considerations

When you run PostGraphile in production you'll want to ensure that people cannot easily trigger denial of service (DOS) attacks against you. Due to the nature of GraphQL it's easy to construct a small query that could be very expensive for the server to run, for example:

allUsers {
  nodes {
    postsByAuthorId {
      nodes {
        commentsByPostId {
          userByAuthorId {
            postsByAuthorId {
              nodes {
                commentsByPostId {
                  userByAuthorId {
                    postsByAuthorId {
                      nodes {
                        commentsByPostId {
                          userByAuthorId {
                            id
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

There's lots of techniques for protecting your server from these kinds of queries; a great introduction to this subject is this blog post from Apollo.

These techniques should be used in conjunction with common HTTP protection methods such as rate limiting which are typically better implemented at a separate layer; for example you could use Cloudflare rate limiting for this.

Simple: Query Whitelist ("persisted queries")

If you do not intend to open your API up to third parties to run arbitrary queries against then using persisted queries as a query whitelist to protect your GraphQL endpoint is a good solution. This technique ensures that only the queries you use in your own applications can be executed on the server (but you can of course change the variables).

This technique has a few caveats:

  • Your API will only accept queries that you've approved, so it's not suitable if you want third parties to run arbitrary queries
  • You must be able to generate a unique ID from each query; e.g. a hash
  • You must use "static GraphQL queries" - that is the queries must be known at build time of your application/webpage, and only the variables fed to those queries can change at run-time
  • You must have a way of sharing these queries between the application and the server
  • You must be careful not to use variables in dangerous places; for example don't write allUsers(first: $myVar) as a malicious attacker could set $myVar to 2147483647 in order to cause your server to process as much data as possible.

PostGraphile currently doesn't have this functionality built in, but it's fairly easy to add it when using PostGraphile as an express middleware, a simple implementation might look like this:

const postgraphile = require('postgraphile');
const express = require('express');
const bodyParser = require('body-parser');

const app = express();
app.use(bodyParser.json());

/**** BEGINNING OF CUSTOMIZATION ****/
const persistedQueries = require('./persistedQueries.json');

app.use('/graphql', async (req, res, next) => {
  // TODO: validate req.body is of the right form
  req.body.query =
    {}.hasOwnProperty.call(persistedQueries, req.body.id)
      ? persistedQueries[req.body.id]
      : null;
  next();
});
/**** END OF CUSTOMIZATION *** */

app.use(postgraphile());

app.listen(5000);

i.e. a simple middleware mounted before postgraphile that manipulates the request body.

I personally use my forks of Apollo's persistgraphql tools to help me manage the persisted queries themselves:

These forks generate hashes rather than numbers; which make the persisted queries consistent across multiple builds and applications (website, mobile, browser plugin, ...).

NOTE: even if you're using persisted queries, it can be wise to implement the advanced protections as it enables you to catch unnecessarily expensive queries before you start facing performance bottlenecks down the line.

Advanced

Using a query whitelist puts the decision in the hands of your engineers whether a particular query should be accepted or not. Sometimes this isn't enough - it could be that your engineers need guidance to help them avoid common pit-falls (e.g. forgetting to put limits on collections they query), or it could be that you wish arbitrary third parties to be able to send queries to your API without the queries being pre-approved and without the risk of bringing your servers to their knees.

You are highly encouraged to purchase the Pro Plugin pro, which implements these protections in a deeply integrated and PostGraphile optimised way, and has the added benefit of helping sustain development and maintenance on the project.

The following details how the Pro Plugin addresses these issues, including hints on how you might go about solving the issues for yourself. Many of these techniques can be implemented outside of PostGraphile, for example in an express middleware or a nginx reverse proxy between PostGraphile and the client.

Sending queries to read replicas

Probably the most important thing regarding scalability is making sure that your master database doesn't bow under the pressure of all the clients talking to it. One way to reduce this pressure is to offload read operations to read replicas. In GraphQL it's easy to tell if a request will perform any writes or not: if it's a query then it's read-only, if it's a mutation then it may perform writes.

Using --read-only-connection <string> pro you may give PostGraphile a separate connection string to use for queries, to compliment the connection string passed via --connection which will now be used only for mutations.

(If you're using middleware, then you'll want to pass a read-only pool to the readReplicaPgPoolpro option instead.)

NOTE: We don't currently support the multi-host syntax for this connection string, but you can use a PostgreSQL proxy such a PgPool or PgBouncer between PostGraphile and your database to enable connecting to multiple read replicas.

Pagination caps

It's unlikely that you want users to request allUsers and receive back literally all of the users in the database. More likely you want users to use cursor-based pagination over this connection with first / after. The Pro Plugin introduces the --default-pagination-cap [int] pro option which enables you to enforce a pagination cap on all connections. Whatever number you pass will be used as the pagination cap, but you can override it on a table-by-table basis using smart comments - in this case the @paginationCappro smart comment.

comment on table users is
  E'@paginationCap 20\nSomeone who can log in.';

Limiting GraphQL query depth

Most GraphQL queries tend to be only a few levels deep, queries like the deep one at the top of this article are generally not required. You may use --graphql-depth-limit [int] pro to limit the depth of any GraphQL queries that hit PostGraphile - any deeper than this will be discarded during query validation.

[EXPERIMENTAL] GraphQL cost limit

The most powerful way of preventing DOS is to limit the cost of GraphQL queries that may be executed against your GraphQL server. The Pro Plugin contains a very early implementation of this technique, but the costs are not very accurate yet. You may enable a cost limit with --graphql-cost-limit [int] pro and the calculated cost of any GraphQL queries will be made available on meta field in the GraphQL payload.

If your GraphQL query is seen to be too expensive, here's some techniques to bring the calculated cost down:

  • If you've not specified a limit (first/last) on a connection, we assume it will return 1000 results. If you're expecting fewer than this, specify the maximum you'd ever expect to receive.
  • Cost is based on number of expected results (without looking at the database!) so lower your limits on connections.
  • Connections multiply the cost of their children by the number of results they're expected to return, so lower the limits on connections.
  • Nested fields multiply costs; so pulling a connection inside a connection inside a connection is going to be expensive - to address this, try placing lower first/last values on the connections or avoiding fetching nested data until you need to display it (split into multiple requests / only request the data you need).
  • Subscriptions are automatically seen as 10x as expensive as queries - try and minimise the amount of data your subscription requests.
  • Procedure connections are treated as more expensive than table connections.
  • totalCount on a table has a fair cost
  • totalCount on a procedure has a higher cost
  • Using pageInfo adds significant cost to connections
  • Computed columns are seen as fairly expensive - in future we may factor in PostgreSQL's COST parameter when figuring this out.

Keep in mind this is a very early implementation of cost analysis, there's much improvement to be made. Feel free to reach out with any bad costs/queries so we can improve it.