Tips to shorten SQL query time when there are soooo many rows

There are soooo many transactions and if I create a query that scans the whole table, it almost never finishes. Here are the steps I took to solve this with help from Ema.

Describe what I want to query accurately

If we know exactly what we want, we will notice the unnecessary scanning/joining/N+1 problem.

To put today’s goal clear:

I want to find out all the outlets that have been transacting using a specific payment type.

Start with a “native” approach

Here is the “native” implementation I did:

select outlet.id as outlet_id, outlet.retailer_id, 
EXISTS(select 1 from payment.transaction transaction where transaction.id >= '1587232952939446272' AND transaction.retailer_id=outlet.retailer_id AND transaction.payment_type_id = 167 and transaction.register_id in (select register.id from vend.vend_register register where register.outlet_id=outlet.id))
from vend.vend_outlet outlet
where outlet.deleted_at is NULL and outlet.source_deleted_at is NULL

For the transaction.id >= '1587232952939446272' see another note on using flake ID to retrieve creation time of the ID.

Takes forever.

`explain`: see how the query will be executed

explain select outlet.id as outlet_id, outlet.retailer_id, 
EXISTS(select 1 from payment.transaction transaction where transaction.id >= '1587232952939446272' AND transaction.payment_type_id = 167 and transaction.register_id in (select register.id from vend.vend_register register where register.outlet_id=outlet.id))
from vend.vend_outlet outlet
where outlet.deleted_at is NULL and outlet.source_deleted_at is NULL

Explain output:

Seq Scan on vend_outlet outlet (cost=0.00..23108866749.20 rows=179496 width=75)
Filter: ((deleted_at IS NULL) AND (source_deleted_at IS NULL))
SubPlan 1
-> Nested Loop Semi Join (cost=423.08..128742.99 rows=1 width=0)
-> Bitmap Heap Scan on transaction (cost=422.53..127818.52 rows=107 width=37)
Recheck Cond: (retailer_id = outlet.retailer_id)
Filter: ((id >= '1587232952939446272'::text) AND (payment_type_id = 167))
-> Bitmap Index Scan on idx_16548_retailer_id_idx (cost=0.00..422.50 rows=33858 width=0)
Index Cond: (retailer_id = outlet.retailer_id)
-> Index Scan using vend_register_pkey on vend_register register (cost=0.55..8.56 rows=1 width=37)
Index Cond: (id = transaction.register_id)
Filter: (outlet_id = outlet.id)

This query goes through outlets and find out if a transaction exists in the table. This is the classic N+1 problem:

The N + 1 problem occurs when an application gets data from the database, and then loops through the result of that data. That means we call to the database again and again and again. In total, the application will call the database once for every row returned by the first query (N) plus the original query ( + 1).

Is the query following any index?

SELECT
    tablename,
    indexname,
    indexdef
FROM
    pg_indexes
WHERE
    schemaname = 'payment'
    and tablename = 'transaction'
ORDER BY
    tablename,
    indexname;

Output:

transaction
idx_16548_primary
CREATE UNIQUE INDEX idx_16548_primary ON payment.transaction USING btree (id)

transaction
idx_16548_register_sale_payment_id
CREATE UNIQUE INDEX idx_16548_register_sale_payment_id ON payment.transaction USING btree (register_sale_payment_id)

transaction
idx_16548_retailer_id_idx
CREATE INDEX idx_16548_retailer_id_idx ON payment.transaction USING btree (retailer_id)

To speed up the query, find opportunities to retrieve data by index.

For this query, at least we can add an additional transaction.retailer_id=outlet.retailer_id to speed up the querying of transaction.

Can looped query be replaced with a smart join?

All I need is a list of outlet ids - Ema tipped that a distinct will help here too

SELECT distinct register.outlet_id
FROM payment.transaction transaction
LEFT JOIN vend.vend_register register on register.id=transaction.register_id
WHERE transaction.id >= '1575998929105846272' AND transaction.payment_type_id = 167

Thoughts

Read on: https://www.sqlshack.com/sql-index-overview-and-strategy/
Be aware of N+1 problems, try to avoid by using a join instead.
Revert the query to scan the big table and find distinct elements, instead of querying the big table for EXISTS