Harbinger Explorer

Back to Knowledge Hub
Engineering
Published:

SQL Anti-Patterns: Common Mistakes and How to Fix Them

9 min read·Tags: sql, data-engineering, sql-anti-patterns, query-optimization, data-quality, best-practices

A query that works is not the same as a query that's correct, maintainable, or fast. SQL anti-patterns are the recurring mistakes that slow down pipelines, produce wrong results, and make future engineers want to delete your work. Most of them are easy to avoid once you know what to look for.

1. SELECT * in Production Pipelines

SELECT * feels convenient. In a pipeline, it's a ticking time bomb.

The problem: When a source table adds a column, your downstream SELECT * picks it up automatically — potentially breaking transformations that expect a fixed schema, bloating storage, or passing sensitive columns downstream without intent.

-- ❌ Anti-pattern (PostgreSQL)
CREATE TABLE fct_orders AS
SELECT *
FROM raw.orders
JOIN raw.customers USING (customer_id);

-- ✅ Fix: explicit column list
CREATE TABLE fct_orders AS
SELECT
    o.order_id,
    o.customer_id,
    o.order_total,
    o.placed_at,
    c.email,
    c.country
FROM raw.orders o
JOIN raw.customers c USING (customer_id);

Exception: SELECT * is fine in ad-hoc exploration. Never in a dbt model, a view, or a scheduled pipeline.

2. Correlated Subqueries Instead of JOINs

A correlated subquery runs once per row in the outer query. On a table with 10 million rows, that's 10 million subquery executions.

-- ❌ Anti-pattern (PostgreSQL) — correlated subquery, O(n) subquery executions
SELECT
    customer_id,
    total_spend,
    (SELECT AVG(total_spend) FROM orders) AS avg_spend  -- OK, uncorrelated
    (SELECT MAX(placed_at)                              -- CORRELATED — runs per row
     FROM orders o2
     WHERE o2.customer_id = o1.customer_id) AS last_order_date
FROM orders o1;

-- ✅ Fix: window function
SELECT
    customer_id,
    total_spend,
    AVG(total_spend) OVER () AS avg_spend,
    MAX(placed_at) OVER (PARTITION BY customer_id) AS last_order_date
FROM orders;

When correlated subqueries are acceptable: EXISTS checks where early termination matters, or LATERAL joins in PostgreSQL for complex per-row logic. Measure first.

3. Implicit Type Conversions

When you compare a string column to an integer literal, the database silently converts one type to the other. This often disables index usage and can produce wrong results with NULL edge cases.

-- ❌ Anti-pattern (PostgreSQL) — implicit conversion, index on status_code not used
SELECT * FROM orders WHERE status_code = 1;
-- status_code is VARCHAR — DB converts 1 to '1' and scans the table

-- ✅ Fix: match types explicitly
SELECT * FROM orders WHERE status_code = '1';

-- ❌ Anti-pattern — date as string, wrong results in some dialects
SELECT * FROM events WHERE event_date > '2026-01-01';
-- Works in PostgreSQL but silently fails in some MySQL configs

-- ✅ Fix: explicit cast
SELECT * FROM events WHERE event_date > DATE '2026-01-01';  -- PostgreSQL/BigQuery
-- or
SELECT * FROM events WHERE event_date > CAST('2026-01-01' AS DATE);  -- universal

In Spark SQL, implicit type coercion is especially dangerous — Spark will try to cast between types and produce null instead of an error when it fails.

4. NOT IN with NULLs

NOT IN with a subquery that can return NULL is one of the most common sources of silent data loss in SQL. The result is always an empty set when NULLs are present.

-- ❌ Anti-pattern (PostgreSQL) — returns 0 rows if cancelled_orders contains any NULL
SELECT order_id
FROM orders
WHERE order_id NOT IN (SELECT order_id FROM cancelled_orders);

-- If cancelled_orders has even one NULL order_id, this returns nothing.
-- SQL logic: NOT IN expands to AND order_id != val1 AND order_id != val2 ...
-- NULL comparison is UNKNOWN, so the whole expression is UNKNOWN (not TRUE).

-- ✅ Fix: use NOT EXISTS
SELECT o.order_id
FROM orders o
WHERE NOT EXISTS (
    SELECT 1 FROM cancelled_orders co WHERE co.order_id = o.order_id
);

-- ✅ Alternative: LEFT JOIN / IS NULL pattern
SELECT o.order_id
FROM orders o
LEFT JOIN cancelled_orders co ON o.order_id = co.order_id
WHERE co.order_id IS NULL;

Always use NOT EXISTS or a LEFT JOIN ... IS NULL when excluding rows based on a subquery that might contain NULLs.

5. Aggregating Before Joining

Joining large tables and then aggregating is slower than aggregating first and then joining — but the order matters for correctness.

-- ❌ Anti-pattern (Spark SQL) — join full tables, then aggregate (huge shuffle)
SELECT
    c.country,
    SUM(o.total_cents) AS revenue
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY c.country;

-- ✅ Fix: aggregate orders first, then join (smaller table to join)
WITH order_agg AS (
    SELECT customer_id, SUM(total_cents) AS total_revenue
    FROM orders
    GROUP BY customer_id
)
SELECT
    c.country,
    SUM(oa.total_revenue) AS revenue
FROM order_agg oa
JOIN customers c ON oa.customer_id = c.customer_id
GROUP BY c.country;

The fix reduces the row count before the join. In Spark, this dramatically reduces shuffle volume. In a standard RDBMS, the query planner often does this automatically — but not always. Don't assume.

6. Using HAVING Instead of WHERE for Non-Aggregate Filters

WHERE filters rows before aggregation. HAVING filters after. Using HAVING for non-aggregate predicates forces the database to aggregate all rows and then throw most of them away.

-- ❌ Anti-pattern (PostgreSQL) — country filter applied AFTER aggregation
SELECT country, COUNT(*) AS customer_count
FROM customers
GROUP BY country
HAVING country = 'DE';  -- aggregates all countries, then filters to DE

-- ✅ Fix: filter with WHERE before aggregation
SELECT country, COUNT(*) AS customer_count
FROM customers
WHERE country = 'DE'
GROUP BY country;

HAVING is correct when the predicate involves an aggregate (HAVING COUNT(*) > 100). For plain column filters, always use WHERE.

7. Overusing Subqueries Instead of CTEs

Deep subquery nesting is unreadable and unmaintainable. Modern SQL (PostgreSQL, BigQuery, Snowflake, DuckDB, Spark SQL) all support CTEs — use them.

-- ❌ Anti-pattern — three levels of subquery nesting
SELECT *
FROM (
    SELECT customer_id, AVG(order_total) AS avg_order
    FROM (
        SELECT customer_id, order_id, SUM(line_total) AS order_total
        FROM (
            SELECT * FROM order_lines WHERE is_cancelled = FALSE
        ) active_lines
        GROUP BY customer_id, order_id
    ) order_totals
    GROUP BY customer_id
) customer_avgs
WHERE avg_order > 100;

-- ✅ Fix: CTEs for readability
-- DuckDB SQL
WITH active_lines AS (
    SELECT customer_id, order_id, line_total
    FROM order_lines
    WHERE is_cancelled = FALSE
),
order_totals AS (
    SELECT customer_id, order_id, SUM(line_total) AS order_total
    FROM active_lines
    GROUP BY customer_id, order_id
),
customer_avgs AS (
    SELECT customer_id, AVG(order_total) AS avg_order
    FROM order_totals
    GROUP BY customer_id
)
SELECT *
FROM customer_avgs
WHERE avg_order > 100;

CTEs are not just style. They're testable (dbt can ref individual CTE layers), cacheable in some engines, and understandable by the next engineer.

8. Non-SARGable Predicates

A predicate is SARGable (Search ARGument able) when the database can use an index to evaluate it. Wrapping a column in a function breaks SARGability.

-- ❌ Anti-pattern (PostgreSQL) — function on column, index on placed_at not used
SELECT * FROM orders WHERE YEAR(placed_at) = 2026;
SELECT * FROM orders WHERE DATE_TRUNC('month', placed_at) = '2026-01-01';
SELECT * FROM orders WHERE UPPER(email) = 'ALICE@EXAMPLE.COM';

-- ✅ Fix: push the transform to the literal, not the column
SELECT * FROM orders
WHERE placed_at >= '2026-01-01' AND placed_at < '2027-01-01';

SELECT * FROM orders
WHERE placed_at >= '2026-01-01' AND placed_at < '2026-02-01';

SELECT * FROM orders WHERE email = LOWER('ALICE@EXAMPLE.COM');
-- or better: lowercase at write time, query lowercase column

9. Forgetting NULL Semantics in Aggregations

COUNT(*) counts all rows including NULLs. COUNT(column) counts non-NULL values. AVG(column) ignores NULLs (which is usually what you want, but sometimes isn't).

-- ❌ Misleading: COUNT(*) vs COUNT(column) confusion (PostgreSQL)
SELECT
    COUNT(*) AS total_rows,           -- 1000 (includes NULLs)
    COUNT(email) AS with_email,       -- 800 (NULLs excluded)
    AVG(order_total) AS avg_total     -- average EXCLUDES null order_total rows
FROM orders;

-- ✅ Be explicit about NULL handling
SELECT
    COUNT(*) AS total_rows,
    COUNT(email) AS rows_with_email,
    COUNT(*) FILTER (WHERE email IS NULL) AS rows_without_email,  -- PostgreSQL syntax
    AVG(COALESCE(order_total, 0)) AS avg_total_incl_nulls_as_zero
FROM orders;

10. Joining on Non-Unique Keys Without Checking

If you join two tables on a key that isn't unique in both tables, you silently create a Cartesian product for matching rows. Revenue numbers explode. Duplicates appear everywhere.

-- ❌ Anti-pattern — orders.customer_id is unique, but promotions.customer_id is NOT
-- This multiplies rows silently
SELECT o.order_id, o.total, p.promo_code
FROM orders o
JOIN promotions p ON o.customer_id = p.customer_id;
-- If a customer has 3 promotions, every order appears 3 times

-- ✅ Fix: validate uniqueness before joining
-- First, check:
SELECT customer_id, COUNT(*) FROM promotions GROUP BY 1 HAVING COUNT(*) > 1;

-- Then use the right join type (e.g., latest promo only):
WITH latest_promo AS (
    SELECT customer_id, promo_code,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn
    FROM promotions
)
SELECT o.order_id, o.total, lp.promo_code
FROM orders o
LEFT JOIN latest_promo lp ON o.customer_id = lp.customer_id AND lp.rn = 1;

See Data Quality Testing for how to build automated checks that catch duplicate join keys before they reach production.

Quick Reference Cheat Sheet

Anti-PatternQuick Fix
SELECT * in pipelinesExplicit column list
Correlated subqueryWindow function or pre-aggregated JOIN
Implicit type castMatch types or use explicit CAST
NOT IN with NULLsUse NOT EXISTS instead
HAVING for non-aggregate filterMove to WHERE clause
Nested subqueriesRefactor to CTEs
Function on column in WHEREPush transform to literal
Ignoring NULL semanticsUse COUNT(col), COALESCE, FILTER
Fan-out joinValidate key uniqueness before joining

A Note on Dialects

SQL is not one language — it's a family of dialects. DATE_TRUNC is PostgreSQL and BigQuery syntax. Spark SQL uses TRUNC(date, 'MONTH'). FILTER (WHERE ...) is PostgreSQL; BigQuery and Spark use COUNTIF. Always specify your dialect when sharing SQL, and always test on the target engine.

If you want to explore and validate SQL across different data sources without spinning up infrastructure, Harbinger Explorer runs DuckDB SQL directly in the browser. It's useful for testing query logic on CSVs or API data before porting to your production warehouse.

Wrapping Up

Most SQL bugs are invisible — they don't throw errors, they just return wrong numbers. The anti-patterns above are especially dangerous because they work until they don't: the wrong row count only matters when someone actually checks the numbers.

Next step: Pick one query in your current pipeline and run it through this checklist. The NOT IN / NULL and fan-out join patterns are the highest-probability bugs to find in existing code.


Continue Reading


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...