sql - 在每个GROUP BY组中选择第一行？(Select first row in each GROUP BY group?)

Question

Welcome To Ask or Share your Answers For Others

sql - 在每个GROUP BY组中选择第一行？(Select first row in each GROUP BY group?)

asked Feb 21, 2021 in Technique[技术] by 深蓝 (71.8m points)

sql - 在每个GROUP BY组中选择第一行？(Select first row in each GROUP BY group?)

As the title suggests, I'd like to select the first row of each set of rows grouped with a GROUP BY .

(顾名思义，我想选择以GROUP BY分组的每组行的第一行。)

Specifically, if I've got a purchases table that looks like this:

(具体来说，如果我有一个如下的purchases表：)

SELECT * FROM purchases;

My Output:

(我的输出：)

id | customer | total
---+----------+------
 1 | Joe      | 5
 2 | Sally    | 3
 3 | Joe      | 2
 4 | Sally    | 1

I'd like to query for the id of the largest purchase ( total ) made by each customer .

(我想查询每个customer购买的最大商品的id （ total ）。)

Something like this:

(像这样：)

SELECT FIRST(id), customer, FIRST(total)
FROM  purchases
GROUP BY customer
ORDER BY total DESC;

Expected Output:

(预期产量：)

FIRST(id) | customer | FIRST(total)
----------+----------+-------------
        1 | Joe      | 5
        2 | Sally    | 3

ask by David Wolever translate from so

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-02-20T21:45:10+0000

In PostgreSQL this is typically simpler and faster (more performance optimization below):

(在PostgreSQL中，这通常更简单，更快 （下面将进行更多性能优化）：)

SELECT DISTINCT ON) (customer)
       id, customer, total
FROM   purchases
ORDER  BY customer, total DESC, id;

Or shorter (if not as clear) with ordinal numbers of output columns:

(或更短（如果不清楚），输出列的序号为：)

SELECT DISTINCT ON (2)
       id, customer, total
FROM   purchases
ORDER  BY 2, 3 DESC, 1;

If total can be NULL (won't hurt either way, but you'll want to match existing indexes):

(如果total可以为NULL（无论哪种方式都没有问题，但是您需要匹配现有索引）：)

...
ORDER  BY customer, total DESC NULLS LAST), id;

Major points (要点)

DISTINCT ON is a PostgreSQL extension of the standard (where only DISTINCT on the whole SELECT list is defined).
(DISTINCT ON是该标准的PostgreSQL扩展（仅在整个SELECT列表上定义了DISTINCT ）。)
List any number of expressions in the DISTINCT ON clause, the combined row value defines duplicates.
(在DISTINCT ON子句中列出任意数量的表达式，组合的行值定义重复项。)
The manual:
(手册：)

Obviously, two rows are considered distinct if they differ in at least one column value.
(显然，如果两行至少有一个列值不同，则认为它们是不同的。)
Null values are considered equal in this comparison.
(在此比较中，将空值视为相等。)

Bold emphasis mine.
(大胆强调我的。)
DISTINCT ON can be combined with ORDER BY .
(DISTINCT ON可以与ORDER BY结合使用。)
Leading expressions have to match leading DISTINCT ON expressions in the same order.
(前导表达式必须以相同顺序匹配前导DISTINCT ON表达式。)
You can add additional expressions to ORDER BY to pick a particular row from each group of peers.
(您可以向ORDER BY添加其他表达式，以从每组对等体中选择特定的行。)
I added id as last item to break ties:
(我添加了id作为打破联系的最后一项：)
"Pick the row with the smallest id from each group sharing the highest total ."
(“从每个组中选择id最小的行，共享total最大的行。”)
To order results in a way that disagrees with the sort order determining the first per group, you can nest above query in an outer query with another ORDER BY .
(要以与确定每个组第一个排序顺序不同的排序方式来对结果进行排序，可以将上面的查询嵌套在另一个ORDER BY的外部查询中。)
Like:
(喜欢：)
- PostgreSQL DISTINCT ON with different ORDER BY
  (PostgreSQL使用不同的ORDER BY打开)
If total can be NULL, you most probably want the row with the greatest non-null value.
(如果total可以为NULL，则您很可能希望具有最大非空值的行。)
Add NULLS LAST like demonstrated.
(NULLS LAST添加NULLS LAST 。)
Details:
(细节：)
- PostgreSQL sort by datetime asc, null first?
  (PostgreSQL按datetime asc排序，是否为null？)
The SELECT list is not constrained by expressions in DISTINCT ON or ORDER BY in any way.
(SELECT列表不受DISTINCT ON或ORDER BY中的表达式的任何限制。)
(Not needed in the simple case above):
(（在上面的简单情况下不需要）：)
- You don't have to include any of the expressions in DISTINCT ON or ORDER BY .
  (您不必在DISTINCT ON或ORDER BY包含任何表达式。)
- You can include any other expression in the SELECT list.
  (您可以在SELECT列表中包括任何其他表达式。)
  This is instrumental for replacing much more complex queries with subqueries and aggregate / window functions.
  (这有助于用子查询和聚合/窗口函数替换更复杂的查询。)
I tested with Postgres versions 8.3 – 12. But the feature has been there at least since version 7.1, so basically always.
(我使用Postgres 8.3 – 12版进行了测试。但是至少从7.1版开始，该功能就存在了，因此基本上总是如此。)

Index (指数)

The perfect index for the above query would be a multi-column index spanning all three columns in matching sequence and with matching sort order:

(上面查询的理想索引是一个多列索引，它以匹配顺序和匹配的排序顺序跨越所有三列：)

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

May be too specialized.

(可能太专业了。)

But use it if read performance for the particular query is crucial.

(但是，如果特定查询的读取性能至关重要，请使用它。)

If you have DESC NULLS LAST in the query, use the same in the index so that sort order matches and the index is applicable.

(如果查询中具有DESC NULLS LAST ，则在索引中使用相同的字符，以便排序顺序匹配并且索引适用。)

Effectiveness / Performance optimization (效果/性能优化)

Weigh cost and benefit before creating tailored indexes for each query.

(在为每个查询创建量身定制的索引之前，请权衡成本和收益。)

The potential of above index largely depends on data distribution .

(上述指标的潜力在很大程度上取决于数据分布 。)

The index is used because it delivers pre-sorted data.

(使用索引是因为它提供了预排序的数据。)

In Postgres 9.2 or later the query can also benefit from an index only scan if the index is smaller than the underlying table.

(在Postgres 9.2或更高版本中，如果索引小于基础表，则查询也可以从仅索引扫描中受益。)

The index has to be scanned in its entirety, though.

(但是，必须完整扫描索引。)

For few rows per customer (high cardinality in column customer ), this is very efficient.
(对于每个客户几行 （列customer基数高），这是非常有效的。)
Even more so if you need sorted output anyway.
(如果您仍然需要排序的输出，则更是如此。)
The benefit shrinks with a growing number of rows per customer.
(随着每个客户行数的增加，收益也随之减少。)
Ideally, you have enough work_mem to process the involved sort step in RAM and not spill to disk.
(理想情况下，您有足够的work_mem来处理RAM中涉及的排序步骤，并且不会溢出到磁盘上。)
But generally setting work_mem too high can have adverse effects.
(但是通常将work_mem设置得太高会产生不利影响。)
Consider SET LOCAL for exceptionally big queries.
(考虑将SET LOCAL用于特别大的查询。)
Find how much you need with EXPLAIN ANALYZE .
(使用EXPLAIN ANALYZE查找您需要多少。)
Mention of " Disk: " in the sort step indicates the need for more:
(在排序步骤中提到“ 磁盘： ”表示需要更多：)
- Configuration parameter work_mem in PostgreSQL on Linux
  (Linux上PostgreSQL中的配置参数work_mem)
- Optimize simple query using ORDER BY date and text
  (使用ORDER BY日期和文本优化简单查询)
For many rows per customer (low cardinality in column customer ), a loose index scan (aka "skip scan") would be (much) more efficient, but that's not implemented up to Postgres 12. (An implementation for index-only scans is in development for Postgres 13. See <a href="https://stackoom.

Categories

sql - 在每个GROUP BY组中选择第一行？(Select first row in each GROUP BY group?)

sql - 在每个GROUP BY组中选择第一行？(Select first row in each GROUP BY group?)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Major points (要点)

Index (指数)

Effectiveness / Performance optimization (效果/性能优化)

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags