I faced a problem with a data loss, caused by a wrong query.
Data restored, but now I'd like to understand the problem.
I encountered the problem on SQL Server 2014, but I replicated it on SQL Server 2000 and PostgreSQL. Specifically, there was a DELETE. In the following scenario I use a SELECT.
The tables creation for sql server 2014:
CREATE TABLE [dbo].[tmp_color](
[color_id] [int] NOT NULL,
[color_name] [nvarchar](50) NOT NULL,
[color_cat] [int] NOT NULL,
CONSTRAINT [PK_tmp_color] PRIMARY KEY CLUSTERED (
[color_id] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF
, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
CREATE TABLE [dbo].[tmp_color_cat](
[catid] [int] NOT NULL,
[catname] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_tmp_color_cat] PRIMARY KEY CLUSTERED (
[catid] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF
, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
And the Postgres version:
CREATE TABLE tmp_color (
color_id integer NOT NULL,
color_name text,
color_cat integer,
CONSTRAINT tmp_color_pkey PRIMARY KEY (color_id)
);
CREATE TABLE tmp_color_cat (
catid integer NOT NULL,
catname text,
CONSTRAINT tmp_color_cat_pkey PRIMARY KEY (catid)
);
Data population (works on both RDBMS):
INSERT INTO tmp_color_cat (catid, catname) VALUES (1, 'magic color');
INSERT INTO tmp_color_cat (catid, catname) VALUES (2, 'normal color');
INSERT INTO tmp_color (color_id, color_name, color_cat) VALUES (1, 'red', 1);
INSERT INTO tmp_color (color_id, color_name, color_cat) VALUES (2, 'green', 2);
INSERT INTO tmp_color (color_id, color_name, color_cat) VALUES (3, 'black', 1);
The following SELECT is wrong:
SELECT color_cat
FROM tmp_color_cat;
Because color_cat
does not exists in tmp_color_cat
.
BUT, the moment you take this in a subquery:
SELECT * FROM tmp_color
WHERE color_cat IN(
SELECT color_cat
FROM tmp_color_cat
WHERE catname = 'magic color'
);
It returns every single record from tmp_color
.
The logical error in script is obvious: developer wrote the wrong column to identify category. If you are deleting records instead of selecting them, you will delete entire table. Not good.
Is this desired behavior? Or it is a consequence of subquery design?
By observing the execution plan of SQL Server, the logical operation is a Left Semi Join.
I found a couple of posts, one for PostgreSQL and one for SQL Server. Is there any good documentation I could send to the developer group explaining why this is not an error?
How can I avoid this kind of problems? My first idea is to use an alias. Aliases are good.
See Question&Answers more detail:
os