Finding sequential records

Lists: pgsql-sql
From: Steve Midgley <science(at)misuse(dot)org>
To: pgsql-sql(at)postgresql(dot)org
Subject: Finding sequential records
Date: 2008-09-26 17:39:13
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-sql

Hi,

I've been kicking this around today and I can't think of a way to solve
my problem in "pure SQL" (i.e. I can only do it with a
looping/cursor-type solution and some variables).

Given a table with this DDL/data script:

drop table if exists dummy;
create table dummy (
id integer primary key,
name varchar(255),
fkey_id integer
)
;
insert into dummy (id, name, fkey_id) values (502163,'3Br/3Ba Pool
Villa in Westin St. John, USVI- Summer 2008',500100);
insert into dummy (id, name, fkey_id) values (502164,'3Br/3Ba Pool
Villa in Westin St. John, USVI- Summer 2008',500100);
insert into dummy (id, name, fkey_id) values (502169,'Lazy Bear
Lodge',105);
-- not sequential id to previous
insert into dummy (id, name, fkey_id) values (502199,'Lazy Bear
Lodge',105);
insert into dummy (id, name, fkey_id) values (502170,'3 Bed, 1 Bath
Cottage Less Than a Mile from West Dennis Beach',500089);
insert into dummy (id, name, fkey_id) values (502171,'3 Bed, 1 Bath
Cottage Less Than a Mile from West Dennis Beach',500089);
-- not sequential id nor duplicate fkey_id to previous
insert into dummy (id, name, fkey_id) values (502175,'3 Bed, 1 Bath
Cottage Less Than a Mile from West Dennis Beach',500102);
insert into dummy (id, name, fkey_id) values (502213,'Sea
Watch',500128);
-- not duplicate fkey_id to previous
insert into dummy (id, name, fkey_id) values (502214,'Sea
Watch',500130);

Find all instances where
* name is duplicated
* fkey_id is the same (for the any set of duplicated name fields)
* id is sequential (for any set of duplicated name fields)

The system should return

502163
502164
502170
502171

Here's as far as I got:

select id
from dummy
where
name in (
select name from dummy
group by name
having count(name)>1
)
order by id

I can't figure out how to test for duplicate fkey_id when name is the
same, nor to test for sequential id's when name is the same.

Having a method for either would be great, and both would be a bonus!

It seems like there's a clever way to do this without cursors but I
can't figure it out!

Thanks for any help!

Steve


From: "Richard Broersma" <richard(dot)broersma(at)gmail(dot)com>
To: "Steve Midgley" <science(at)misuse(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Finding sequential records
Date: 2008-09-26 18:02:25
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-sql

On Fri, Sep 26, 2008 at 10:39 AM, Steve Midgley <science(at)misuse(dot)org> wrote:
> drop table if exists dummy;
> create table dummy (
> id integer primary key,
> name varchar(255),
> fkey_id integer
> )
> ;

> The system should return
>
> 502163
> 502164
> 502170
> 502171

--first get all of the duplicated ids

SELECT id
FROM Dummy
GROUP BY name, fkey_id

--Next from this list find check to see if there are any sibling
immediate above or below it.

SELECT A.*
FROM ( SELECT ID
FROM Dummy
GROUP BY name, fkey_id ) AS A
INNER JOIN Dummy AS D
ON A.id - 1 = D.id
OR A.id + 1 = D.id;

--
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug


From: "Oliveiros Cristina" <oliveiros(dot)cristina(at)marktest(dot)pt>
To: <pgsql-sql(at)postgresql(dot)org>, "Steve Midgley" <science(at)misuse(dot)org>
Subject: Re: Finding sequential records
Date: 2008-09-26 18:09:24
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-sql

Can this be what you need?

Best,
Oliveiros

SELECT id
FROM dummy a
NATURAL JOIN
(
SELECT fkey_id,name
FROM dummy
GROUP BY fkey_id,name
HAVING COUNT(*) > 1
AND SUM(id) = (MAX(id) + MIN(id)) * (MAX(id) - MIN(id) + 1) / 2
) b
ORDER BY id;

----- Original Message -----
From: "Steve Midgley" <science(at)misuse(dot)org>
To: <pgsql-sql(at)postgresql(dot)org>
Sent: Friday, September 26, 2008 6:39 PM
Subject: [SQL] Finding sequential records

> Hi,
>
> I've been kicking this around today and I can't think of a way to solve
> my problem in "pure SQL" (i.e. I can only do it with a
> looping/cursor-type solution and some variables).
>
> Given a table with this DDL/data script:
>
> drop table if exists dummy;
> create table dummy (
> id integer primary key,
> name varchar(255),
> fkey_id integer
> )
> ;
> insert into dummy (id, name, fkey_id) values (502163,'3Br/3Ba Pool
> Villa in Westin St. John, USVI- Summer 2008',500100);
> insert into dummy (id, name, fkey_id) values (502164,'3Br/3Ba Pool
> Villa in Westin St. John, USVI- Summer 2008',500100);
> insert into dummy (id, name, fkey_id) values (502169,'Lazy Bear
> Lodge',105);
> -- not sequential id to previous
> insert into dummy (id, name, fkey_id) values (502199,'Lazy Bear
> Lodge',105);
> insert into dummy (id, name, fkey_id) values (502170,'3 Bed, 1 Bath
> Cottage Less Than a Mile from West Dennis Beach',500089);
> insert into dummy (id, name, fkey_id) values (502171,'3 Bed, 1 Bath
> Cottage Less Than a Mile from West Dennis Beach',500089);
> -- not sequential id nor duplicate fkey_id to previous
> insert into dummy (id, name, fkey_id) values (502175,'3 Bed, 1 Bath
> Cottage Less Than a Mile from West Dennis Beach',500102);
> insert into dummy (id, name, fkey_id) values (502213,'Sea
> Watch',500128);
> -- not duplicate fkey_id to previous
> insert into dummy (id, name, fkey_id) values (502214,'Sea
> Watch',500130);
>
> Find all instances where
> * name is duplicated
> * fkey_id is the same (for the any set of duplicated name fields)
> * id is sequential (for any set of duplicated name fields)
>
> The system should return
>
> 502163
> 502164
> 502170
> 502171
>
> Here's as far as I got:
>
> select id
> from dummy
> where
> name in (
> select name from dummy
> group by name
> having count(name)>1
> )
> order by id
>
> I can't figure out how to test for duplicate fkey_id when name is the
> same, nor to test for sequential id's when name is the same.
>
> Having a method for either would be great, and both would be a bonus!
>
> It seems like there's a clever way to do this without cursors but I
> can't figure it out!
>
> Thanks for any help!
>
> Steve
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql


From: "Richard Broersma" <richard(dot)broersma(at)gmail(dot)com>
To: "Steve Midgley" <science(at)misuse(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Finding sequential records
Date: 2008-09-26 19:12:35
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-sql

oops I noticed I forgot the having clause:

> SELECT id
> FROM Dummy
> GROUP BY name, fkey_id
Having count(*) > 1;

> SELECT A.*
> FROM ( SELECT ID
> FROM Dummy
> GROUP BY name, fkey_id
HAVING count(*) > 1 ) AS A
> INNER JOIN Dummy AS D
> ON A.id - 1 = D.id
> OR A.id + 1 = D.id;

--
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug


From: Steve Midgley <science(at)misuse(dot)org>
To: "Richard Broersma" <richard(dot)broersma(at)gmail(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org,oliveiros(dot)cristina(at)marktest(dot)pt
Subject: Re: Finding sequential records
Date: 2008-09-26 22:25:59
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-sql

Wow.

Thanks to both Richard and Oliveiros.

Out of the box Oliveiros' solution does what I want but I don't
understand why!

>SELECT id
>FROM dummy a
>NATURAL JOIN (
>SELECT fkey_id,name
>FROM dummy
>GROUP BY fkey_id,name
>HAVING COUNT(*) > 1 AND SUM(id) = (MAX(id) + MIN(id)) * (MAX(id) -
>MIN(id) + 1) / 2
>) b
>ORDER BY id;

What's going on here with the sum(id) equaling the average product of
the min and max? I gather that's to match id's with id's that are one
bigger than itself? Can anyone clarify how that is working?

Richard's sql is very interesting to me in concept - but it's not
getting me the results correctly:

>SELECT A.*
> FROM ( SELECT ID
> FROM Dummy
> GROUP BY name, fkey_id ) AS A
>INNER JOIN Dummy AS D
> ON A.id - 1 = D.id
> OR A.id + 1 = D.id;

This returns an error:

ERROR: column "dummy.id" must appear in the GROUP BY clause or be used
in an aggregate function
SQL state: 42803

I'm not sure how to setup that "from select" to produce id's without
adding id to the group by (which would cause the query to return too
many rows). Perhaps a natural join like in Oliveiros' sql would do the
job?

Thanks for any advice on either of these solutions. I'm going to learn
a lot here if someone can pound it into my head.

Thanks,

Steve

It seems to be returning any records that have sequential id's
regardless
At 11:02 AM 9/26/2008, Richard Broersma wrote:
>On Fri, Sep 26, 2008 at 10:39 AM, Steve Midgley <science(at)misuse(dot)org>
>wrote:
> > drop table if exists dummy;
> > create table dummy (
> > id integer primary key,
> > name varchar(255),
> > fkey_id integer
> > )
> > ;
>
> > The system should return
> >
> > 502163
> > 502164
> > 502170
> > 502171
>
>
>--first get all of the duplicated ids
>
> SELECT id
> FROM Dummy
>GROUP BY name, fkey_id
>
>
>--Next from this list find check to see if there are any sibling
>immediate above or below it.
>
>SELECT A.*
> FROM ( SELECT ID
> FROM Dummy
> GROUP BY name, fkey_id ) AS A
>INNER JOIN Dummy AS D
> ON A.id - 1 = D.id
> OR A.id + 1 = D.id;
>
>--
>Regards,
>Richard Broersma Jr.
>
>Visit the Los Angeles PostgreSQL Users Group (LAPUG)
>http://pugs.postgresql.org/lapug


From: "Oliveiros Cristina" <oliveiros(dot)cristina(at)gmail(dot)com>
To: "Steve Midgley" <science(at)misuse(dot)org>
Cc: "Richard Broersma" <richard(dot)broersma(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org, oliveiros(dot)cristina(at)marktest(dot)pt
Subject: Re: Finding sequential records
Date: 2008-09-27 00:38:08
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-sql

Howdy, Steve.

SELECT id
FROM dummy a
NATURAL JOIN (
SELECT fkey_id,name
FROM dummy
GROUP BY fkey_id,name
HAVING COUNT(*) > 1 AND SUM(id) = (MAX(id) + MIN(id)) * (MAX(id) - MIN(id) +
1) / 2
) b
ORDER BY id;

The GROUP BY clause is to associate records that have the same fkey_id and
name
The COUNT(*) > 1 eliminates the situations when there is just one.
Now, about the equality, now i am thinking and maybe it is a bazooka to kill
a fly. :)
In your table you just have duplicates? Or you may have triplicates? And
quadruplicates? And in general n-uplicates? At the time, I thought you might
have n-uplicates, so I designed the query to be as general as possible to
handle all that cases, from which duplicates are a particular case, but now
i am wondering if you don't have more than duplicates.

Well, anyway the idea is as follows
The sum of a sequence is given by first + last / 2 * n, with n = last -
first + 1, OK ?

So, if the set of ids is sequencial, its sum must equal that expression.
It's basically that.

But I am now wondering now that I might have misunderstood what your
requests were...

If you just have duplicates, then maybe it is cleaner to substitute that
clause by something simpler, like MAX(id) - MIN(id) = 1

I dunno if I fully answered your questions, but if I didn't feel free to ask

Best, Oliveiros

>
>

--
We are going to have peace even if we have to fight for it. - General Dwight
D. Eisenhower

Teremos paz, nem que tenhamos de lutar por ela
- General Dwight D. Eisenhower


From: "Richard Broersma" <richard(dot)broersma(at)gmail(dot)com>
To: "Steve Midgley" <science(at)misuse(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org, oliveiros(dot)cristina(at)marktest(dot)pt
Subject: Re: Finding sequential records
Date: 2008-09-27 01:08:35
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-sql

On Fri, Sep 26, 2008 at 3:25 PM, Steve Midgley <science(at)misuse(dot)org> wrote:

> This returns an error:
>
> ERROR: column "dummy.id" must appear in the GROUP BY clause or be used in an
> aggregate function
> SQL state: 42803

Oops that what I get for trying air code :(

This works instead:

SELECT D1.*
FROM Dummy AS D1
INNER JOIN Dummy AS D2
ON (D1.name,D1.fkey_id)=(D2.name,D2.fkey_id)
AND (D1.id = D2.id + 1 OR D1.id = D2.id - 1 )
ORDER BY D1.id;

--
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug


From: Steve Midgley <science(at)misuse(dot)org>
To: "Oliveiros Cristina" <oliveiros(dot)cristina(at)gmail(dot)com>
Cc: "Richard Broersma" <richard(dot)broersma(at)gmail(dot)com>,pgsql-sql(at)postgresql(dot)org
Subject: Re: Finding sequential records
Date: 2008-09-30 02:48:09
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-sql

At 05:38 PM 9/26/2008, Oliveiros Cristina wrote:
>In-Reply-To: <20080926222618(dot)4DD3664FC01(at)postgresql(dot)org>
>References: <20080926173921(dot)EFDA164FC00(at)postgresql(dot)org>
> <396486430809261102j73869b8es6b325621bcfe1ea6(at)mail(dot)gmail(dot)com>
> <20080926222618(dot)4DD3664FC01(at)postgresql(dot)org>
>Howdy, Steve.
>
>SELECT id
>FROM dummy a
>NATURAL JOIN (
>SELECT fkey_id,name
>FROM dummy
>GROUP BY fkey_id,name
>HAVING COUNT(*) > 1 AND SUM(id) = (MAX(id) + MIN(id)) * (MAX(id) -
>MIN(id) + 1) / 2
>) b
>ORDER BY id;
>
>In your table you just have duplicates? Or you may have triplicates?
>And quadruplicates? And in general n-uplicates? At the time, I thought
>you might have n-uplicates, so I designed the query to be as general
>as possible to handle all that cases, from which duplicates are a
>particular case, but now i am wondering if you don't have more than
>duplicates.

In my specific case it turns out I only had duplicates, but there could
have been n-plicates, so your code is still correct for my use-case
(though I didn't say that in my OP).

>Well, anyway the idea is as follows
>The sum of a sequence is given by first + last / 2 * n, with n = last
>- first + 1, OK ?

I *love* your application of that formula. It's rare for me to be able
to use "real" math in SQL, so this was a pleasure to read (and
understand!)

Thanks again to Richard and Oliveiros for a truly educating experience!
I hope some others were similarly enlightened.

With gratitude,

Steve


From: "Richard Broersma" <richard(dot)broersma(at)gmail(dot)com>
To: "Steve Midgley" <science(at)misuse(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Finding sequential records
Date: 2008-09-30 04:50:14
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-sql

On Mon, Sep 29, 2008 at 7:48 PM, Steve Midgley <science(at)misuse(dot)org> wrote:

> In my specific case it turns out I only had duplicates, but there could have
> been n-plicates, so your code is still correct for my use-case (though I
> didn't say that in my OP).

Ya there are a lot of neat queries that you can construct. If you
have a good background in math and set theory (which I don't have) you
can develop all sorts of powerful analysis queries.

On a side note, I thought that I should mention that unwanted
duplicates are an example where some ~have gotten bitten~ with a
purely surrogate key approach. To make matter worse, is when some
users update part of one duplicate and another updates a different
duplicated on a another field(s). Then once the designer discovers
the duplicate problem, she/he has to figure out some way of merging
these non-exact duplicates. So even if the designer has no intention
of implementing natural primary/foreign keys, he/she will still
benefit from a natural key consideration in that a strategy can be
designed to prevent getting bitten by duplicated data.

I only mention this because db designers get bitten by this all the
time. Well at least the ones that subscribe to www.utteraccess.com
get bitten. From what I've seen not one day has gone by without
someone posting a question to this site about how to both find and
remove all but one of the duplicates.

--
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug


From: Steve Midgley <science(at)misuse(dot)org>
To: "Richard Broersma" <richard(dot)broersma(at)gmail(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Finding sequential records
Date: 2008-09-30 06:05:32
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-sql

At 09:50 PM 9/29/2008, Richard Broersma wrote:
>On Mon, Sep 29, 2008 at 7:48 PM, Steve Midgley <science(at)misuse(dot)org>
>wrote:
>
> > In my specific case it turns out I only had duplicates, but there
> could have
> > been n-plicates, so your code is still correct for my use-case
> (though I
> > didn't say that in my OP).
>
>Ya there are a lot of neat queries that you can construct. If you
>have a good background in math and set theory (which I don't have) you
>can develop all sorts of powerful analysis queries.
>
>On a side note, I thought that I should mention that unwanted
>duplicates are an example where some ~have gotten bitten~ with a
>purely surrogate key approach. To make matter worse, is when some
>users update part of one duplicate and another updates a different
>duplicated on a another field(s). Then once the designer discovers
>the duplicate problem, she/he has to figure out some way of merging
>these non-exact duplicates. So even if the designer has no intention
>of implementing natural primary/foreign keys, he/she will still
>benefit from a natural key consideration in that a strategy can be
>designed to prevent getting bitten by duplicated data.
>
>I only mention this because db designers get bitten by this all the
>time. Well at least the ones that subscribe to www.utteraccess.com
>get bitten. From what I've seen not one day has gone by without
>someone posting a question to this site about how to both find and
>remove all but one of the duplicates.

Truly. I have worked with some school districts around the US and this
duplicate record problem is more than theoretical. Some of the
gnarliest, dirtiest, n-plicate data I've ever seen comes out of the US
public education system.

More generally where I have seen a need for natural keys, I've always
taken the "best of both worlds" approach. So I always stick an
integer/serial PK into any table - why not - they're cheap and
sometimes are handy. And then for tables along the lines of your
description, I add a compound unique index which serves the business
rule of "no dupes along these lines."

Am I following your point? Any reason why using serial PK's with
"compound natural unique indices" is better/worse than just using
natural PK's?

Steve


From: "Richard Broersma" <richard(dot)broersma(at)gmail(dot)com>
To: "Steve Midgley" <science(at)misuse(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Finding sequential records
Date: 2008-09-30 14:02:05
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-sql

On Mon, Sep 29, 2008 at 11:05 PM, Steve Midgley <science(at)misuse(dot)org> wrote:

> Any reason why using serial PK's with "compound
> natural unique indices" is better/worse than just using natural PK's?

Not really, surrogate keys will always work well so long as unwanted
duplicates are constrained. Surrogate Keys will allow as much
flexibility and versatility as is possible. However as the case of
unwanted duplicates illustrates, sometimes flexibility and versatility
is always wanted. So if flexibility and versatility is valued less
than the ability to add inflexible constraints (beyond preventing
duplicates) across multiple relationships, then using natural primary
keys becomes very attractive.

--
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug