Bacula-users

[Bacula-users] Migrating form mysql to postgresql: Duplicate filename entries?

2011-06-27 09:06:22
Subject: [Bacula-users] Migrating form mysql to postgresql: Duplicate filename entries?
From: Gavin McCullagh <gavin.mccullagh AT gcd DOT ie>
To: bacula-users AT lists.sourceforge DOT net
Date: Mon, 27 Jun 2011 14:03:24 +0100
Hi,

I've been experimenting with a migration from MySQL to Postgres.

One problem I've come across is that there are a handful of duplicate files
in the Filename table 

mysql> select count(*) as filecount, Filename.name from Filename  GROUP BY 
Filename.Name ORDER BY filecount DESC LIMIT 30;
+-----------+-----------------------------------+
| filecount | name                              |
+-----------+-----------------------------------+
|         2 | 0                                 |
|         2 | 1                                 |
|         2 | 2                                 |
|         2 | tt172.16.17.36-www_gmail_com.html |

mysql> select * from Filename WHERE name IN 
('0','1','2','tt172.16.17.36-www_gmail_com.html');
+------------+-----------------------------------+
| FilenameId | Name                              |
+------------+-----------------------------------+
|       9247 | 0                                 |
|     101380 | 0                                 |
|       9248 | 1                                 |
|     101381 | 1                                 |
|       9249 | 2                                 |
|     101382 | 2                                 |
|     575752 | tt172.16.17.36-www_gmail_com.html |
|     625369 | tt172.16.17.36-www_gmail_com.html |
+------------+-----------------------------------+

and there doesn't appear to be any constraint preventing this.

mysql> describe Filename;
+------------+------------------+------+-----+---------+----------------+
| Field      | Type             | Null | Key | Default | Extra          |
+------------+------------------+------+-----+---------+----------------+
| FilenameId | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| Name       | blob             | NO   | MUL | NULL    |                |
+------------+------------------+------+-----+---------+----------------+
2 rows in set (0.00 sec)

mysql> show create table Filename;
+----------+-----------------------------------------
| Filename | CREATE TABLE `Filename` (
  `FilenameId` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `Name` blob NOT NULL,
  PRIMARY KEY (`FilenameId`),
  KEY `Name` (`Name`(255))
) ENGINE=MyISAM AUTO_INCREMENT=4277244 DEFAULT CHARSET=latin1 |
+----------+-----------------------------------------

However, the postgresql table does have this constraint and complains when
I try to insert the data.

bacula=# COPY Filename FROM '/var/tmp/bacula-backup/Filename.txt';
ERROR:  duplicate key value violates unique constraint "filename_name_idx"
CONTEXT:  COPY filename, line 101380: "101380   0"

I could of course prune four duplicate lines from the data before
inserting, but I'm afraid of the possible effect on a future restore.

It appears there are duplicate entries in the File database for each time there
is a duplicate in the Filename table:

mysql> SELECT * FROM File WHERE FilenameId IN (select FilenameId from Filename 
WHERE name IN ('0','1','2','tt172.16.17.36-www_gmail_com.html')) ;
+-----------+-----------+-------+---------+------------+--------+-----------------------------------------------------------+------------------------+
| FileId    | FileIndex | JobId | PathId  | FilenameId | MarkId | LStat         
                                            | MD5                    |
+-----------+-----------+-------+---------+------------+--------+-----------------------------------------------------------+------------------------+
| 245079464 |    207178 | 13471 |    3870 |       9249 |      0 | gB iwKO IGk B 
Po Po A Hg BAA I BNTy22 BKRzfV BKR4jB A A E | R13Gb8BoM3TpapoeZW0FVQ |
| 245079465 |    207178 | 13471 |    3870 |     101382 |      0 | gB iwKO IGk B 
Po Po A Hg BAA I BNTy22 BKRzfV BKR4jB A A E | R13Gb8BoM3TpapoeZW0FVQ |
| 215364675 |    173643 | 11277 |    3870 |       9249 |      0 | gB iwKO IGk B 
Po Po A Hg BAA I BM/DvL BKRzfV BKR4jB A A E | R13Gb8BoM3TpapoeZW0FVQ |
| 215364676 |    173643 | 11277 |    3870 |     101382 |      0 | gB iwKO IGk B 
Po Po A Hg BAA I BM/DvL BKRzfV BKR4jB A A E | R13Gb8BoM3TpapoeZW0FVQ |
| 259000887 |    290784 | 14315 |  745797 |       9248 |      0 | A A IH/ B A A 
A A A A BBg44b 22f3Y BBg44b A A M           | UumkJoY8ZKJd6/HfnwvCDg |
| 259000888 |    290784 | 14315 |  745797 |     101381 |      0 | A A IH/ B A A 
A A A A BBg44b 22f3Y BBg44b A A M           | UumkJoY8ZKJd6/HfnwvCDg |
| 258833500 |    123399 | 14315 |  807939 |       9248 |      0 | A A IH/ B A A 
A A A A BBamBf 22f3Y BBamBf A A M           | G4vlZvkmC8DFxl8y0RsnfA |
| 258833501 |    123399 | 14315 |  807939 |     101381 |      0 | A A IH/ B A A 
A A A A BBamBf 22f3Y BBamBf A A M           | G4vlZvkmC8DFxl8y0RsnfA |
| 244900411 |     28129 | 13471 |    1752 |       9249 |      0 | gB GgUm IGk B 
Po Po A DH BAA I BNTyw4 BJlpIE BJlpIE A A E | ohnd0rdHuoOlhnsfdNGaAw |
| 244900412 |     28129 | 13471 |    1752 |     101382 |      0 | gB GgUm IGk B 
Po Po A DH BAA I BNTyw4 BJlpIE BJlpIE A A E | ohnd0rdHuoOlhnsfdNGaAw |
| 224916087 |     41066 | 12104 | 1079177 |     101381 |      0 | gI wox IGk B 
A A A fO BAA I BNR25S BL3WHV BNRzS7 A A E    | hjBeI+HIpmuRUL7FK42mFA |
| 224916086 |     41066 | 12104 | 1079177 |       9248 |      0 | gI wox IGk B 
A A A fO BAA I BNR25S BL3WHV BNRzS7 A A E    | hjBeI+HIpmuRUL7FK42mFA |
| 224916081 |     41061 | 12104 | 1079177 |     101382 |      0 | gI wos IGk B 
A A A 90 BAA I BNR25S BL3WHV BNRzS7 A A E    | RUHfzzaQ4uJdbkSROzHPjQ |
| 224916080 |     41061 | 12104 | 1079177 |       9249 |      0 | gI wos IGk B 
A A A 90 BAA I BNR25S BL3WHV BNRzS7 A A E    | RUHfzzaQ4uJdbkSROzHPjQ |
| 224916075 |     41056 | 12104 | 1079176 |     101381 |      0 | gI won IGk B 
A A A IS BAA I BNR25S BL3WHT BNRzS7 A A E    | SbEF5jwAkpQM0EeUwM2s9A |


so perhaps this is safe enough.  Does anyone know how these duplicates may
have arisen and what the best way to proceed is?

Gavin


------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>