sql - GROUP or DISTINCT after JOIN returns duplicates

Question

Welcome To Ask or Share your Answers For Others

sql - GROUP or DISTINCT after JOIN returns duplicates

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

sql - GROUP or DISTINCT after JOIN returns duplicates

I have two tables, products and meta. They are in relation 1:N where each product row has at least one meta row via foreign key.

(viz. SQLfiddle: http://sqlfiddle.com/#!15/c8f34/1)

I need to join these two tables but i need to filter only unique products. When I try this query, everything is ok (4 rows returned):

SELECT DISTINCT(product_id)
FROM meta JOIN products ON products.id = meta.product_id

but when I try to select all columns the DISTINCT rule no longer applies to results, as 8 rows instead of 4 is returned.

SELECT DISTINCT(product_id), *
FROM meta JOIN products ON products.id = meta.product_id

I have tried many approaches like trying to DISTINCT or GROUP BY on sub-query but always with same result.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:39:44+0000

While retrieving all or most rows from a table, the fastest way for this type of query typically is to aggregate / disambiguate first and join later:

SELECT *
FROM   products p
JOIN  (
   SELECT DISTINCT ON (product_id) *
   FROM   meta
   ORDER  BY product_id, id DESC
   ) m ON m.product_id = p.id;

The more rows in meta per row in products, the bigger the impact on performance.

Of course, you'll want to add an ORDER BY clause in the subquery do define which row to pick form each set in the subquery. @Craig and @Clodoaldo already told you about that. I am returning the meta row with the highest id.

SQL Fiddle.

Details for DISTINCT ON:

Select first row in each GROUP BY group?

Optimize performance

Still, this is not always the fastest solution. Depending on data distribution there are various other query styles. For this simple case involving another join, this one ran considerably faster in a test with big tables:

SELECT p.*, sub.meta_id, m.product_id, m.price, m.flag
FROM  (
   SELECT product_id, max(id) AS meta_id
   FROM   meta
   GROUP  BY 1
   ) sub
JOIN meta     m ON m.id = sub.meta_id
JOIN products p ON p.id = sub.product_id;

If you wouldn't use the non-descriptive id as column names, we would not run into naming collisions and could simply write SELECT p.*, m.*. (I never use id as column name.)

If performance is your paramount requirement, consider more options:

a MATERIALIZED VIEW with pre-aggregated data from meta, if your data does not change (much).
a recursive CTE emulating a loose index scan for a big meta table with many rows per product (relatively few distinct product_id).
This is the only way I know to use an index for a DISTINCT query over the whole table.

Categories

sql - GROUP or DISTINCT after JOIN returns duplicates

sql - GROUP or DISTINCT after JOIN returns duplicates

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Optimize performance

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags