No title

Started by News Subsystemover 13 years ago1 messagesgeneral

news@news.hub.org

over 13 years ago

Sun, 23 Dec 2012 18:45:11 -0800 (PST)
Received: by 10.49.34.135 with SMTP id z7mr3027069qei.1.1356317111181; Sun, 23
Dec 2012 18:45:11 -0800 (PST)
X-Newsgroups: pgsql.general
Date: Sun, 23 Dec 2012 18:45:11 -0800 (PST)
In-Reply-To: <aa6b08350904241637n4aabb598q88b00b4d51895010@mail.gmail.com>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=203.223.94.194;
posting-account=vWIzlgoAAACC2chFthxRV-tOg_E-Eyba
References: <aa6b08350904241637n4aabb598q88b00b4d51895010@mail.gmail.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <53e63980-d1bd-498e-a8d2-caeaf4e00de2@googlegroups.com>
Subject: Re: [GENERAL] delete duplicates takes too long
From: dhakaboy13@gmail.com
Injection-Date: Mon, 24 Dec 2012 02:45:11 +0000
Content-Type: text/plain; charset=ISO-8859-1
To: pgsql-general@postgresql.org

CREATE TABLE public.cdr_ama_stat (

id int4 NOT NULL DEFAULT nextval('cdr_ama_stat_id_seq'::regclass),

abonado_a varchar(30) NULL,

abonado_b varchar(30) NULL,

fecha_llamada timestamp NULL,

duracion int4 NULL,

puerto_a varchar(4) NULL,

puerto_b varchar(4) NULL,

tipo_llamada char(1) NULL,

processed int4 NULL,

PRIMARY KEY(id)

)

GO

CREATE INDEX kpi_fecha_llamada

ON public.cdr_ama_stat(fecha_llamada)

there should be unique values for abonado_a, abonado_b, fecha_llamada,
duracion in every row, googling around i found how to delete duplicates
in postgresonline site ,

so i run the following query (lets say i want to know how many duplicates exists for 2004-04-18, before delete them):

SELECT * FROM cdr_ama_stat

WHERE id NOT IN

(SELECT MAX(dt.id)

FROM cdr_ama_stat As dt

WHERE dt.fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp + INTERVAL '1 day'

GROUP BY dt.abonado_a, dt.abonado_b,dt.fecha_llamada,dt.duracion)

AND fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp + INTERVAL '1 day'

my problem is that the query take forever, number of rows:

kpi=# select count(*) from cdr_ama_stat;

count

---------

5908065

(1 row)

this the explain result on the above query:

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Index Scan using kpi_fecha_llamada on cdr_ama_stat
(cost=115713.94..79528582.40 rows=140809 width=50)

Index Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp
without time zone) AND (fecha_llamada <= '2009-04-19
00:00:00'::timestamp without time zone))

Filter: (NOT (subplan))

SubPlan

-> Materialize (cost=115713.94..116202.56 rows=28162 width=34)

-> GroupAggregate (cost=110902.49..115478.78 rows=28162 width=34)

-> Sort (cost=110902.49..111606.53 rows=281618 width=34)

Sort Key: dt.abonado_a, dt.abonado_b, dt.fecha_llamada, dt.duracion

-> Bitmap Heap Scan on cdr_ama_stat dt (cost=8580.03..70970.30
rows=281618 width=34)

Recheck Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp
without time zone) AND (fecha_llamada <= '2009-04-19
00:00:00'::timestamp without time zone))

-> Bitmap Index Scan on kpi_fecha_llamada (cost=0.00..8509.62
rows=281618 width=0)

Index Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp
without time zone) AND (fecha_llamada <= '2009-04-19
00:00:00'::timestamp without time zone))

am i doing something wrong?

I think several minutes should be ok, but not several hours as happens
now, i do a bulk load (350k rows aprox) in that table every day, and
yes, i did vacuum full analyze the database before posting this results.

Server is Dual Xeon 3.0 Ghx, 2Gb RAM

best regards

http://DuplicateFilesDeleter.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general