There are soooo many transactions and if I create a query that scans the whole table, it almost never finishes. Here are the steps I took to solve this with help from Ema.
Describe what I want to query accurately
If we know exactly what we want, we will notice the unnecessary scanning/joining/N+1 problem.
To put today’s goal clear:
I want to find out all the outlets that have been transacting using a specific payment type.
Start with a “native” approach
Here is the “native” implementation I did:
1 | select outlet.id as outlet_id, outlet.retailer_id, |
For the transaction.id >= '1587232952939446272'
see another note on using flake ID to retrieve creation time of the ID.
Takes forever.
explain
: see how the query will be executed
1 | explain select outlet.id as outlet_id, outlet.retailer_id, |
Explain output:
1 | Seq Scan on vend_outlet outlet (cost=0.00..23108866749.20 rows=179496 width=75) |
This query goes through outlets and find out if a transaction exists in the table. This is the classic N+1 problem:
The N + 1 problem occurs when an application gets data from the database, and then loops through the result of that data. That means we call to the database again and again and again. In total, the application will call the database once for every row returned by the first query (N) plus the original query ( + 1).
Is the query following any index?
1 | SELECT |
Output:
transaction
idx_16548_primary
CREATE UNIQUE INDEX idx_16548_primary ON payment.transaction USING btree (id)
transaction
idx_16548_register_sale_payment_id
CREATE UNIQUE INDEX idx_16548_register_sale_payment_id ON payment.transaction USING btree (register_sale_payment_id)
transaction
idx_16548_retailer_id_idx
CREATE INDEX idx_16548_retailer_id_idx ON payment.transaction USING btree (retailer_id)
To speed up the query, find opportunities to retrieve data by index.
For this query, at least we can add an additional transaction.retailer_id=outlet.retailer_id
to speed up the querying of transaction.
Can looped query be replaced with a smart join?
All I need is a list of outlet ids - Ema tipped that a distinct
will help here too
1 | SELECT distinct register.outlet_id |
Thoughts
- Read on: https://www.sqlshack.com/sql-index-overview-and-strategy/
- Be aware of N+1 problems, try to avoid by using a
join
instead. - Revert the query to scan the big table and find distinct elements, instead of querying the big table for
EXISTS