Adding newlines in a SQL mysqldump to split extended inserts
The official mysqldump supports more or less two output styles: separate INSERTs (one insert statement per row) or extended INSERTs (one insert per table). Extended INSERTs are much faster, but MySQL write them all in one line, the result being a SQL very hard to read. Can we get the best of both worlds ?
Separate INSERTs
INSERT INTO mytable (id) VALUES (1);
INSERT INTO mytable (id) VALUES (2);Extended INSERTs
INSERT INTO mytable (id) VALUES (1),(2);
New-And-Improved™ INSERTs
INSERT INTO mytable (id) VALUES
(1),
(2);Current solutions
Using sed
mysqldump --extended-insert | sed 's/),(/),\n(/g'Only problem is, lines will be split, even in the middle of strings, altering your data.
Using net_buffer_length
mysqldump --extended-insert --net_buffer_length=5000mysqldump will make sure lines are not longer than 5000 (or whatever), starting a new INSERT when needed. The problem is that the behaviour is kinda random, diffs are hard to analyze and it may break your data if you are storing columns longer than this.
Writing a parser
This question has been often asked without a proper reply, so I decided to write a simple parser. Precisely, we need to check for quotes, parenthesis, and escape characters.
I first wrote it in PHP:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
<?php
# process-mysqldump.php
// Usage: cat dump.sql | php process-mysqldump.php
$input = fopen('php://stdin', 'r');
while(!feof($input)) {
$line = fgets($input);
if (substr($line, 0, 6) == 'INSERT') {
process_line($line);
} else {
echo $line;
}
}
function process_line($line) {
$length = strlen($line);
$pos = strpos($line, ' VALUES ') + 8;
echo substr($line, 0, $pos);
$parenthesis = false;
$quote = false;
$escape = false;
for ($i = $pos; $i < $length; $i++) {
switch($line[$i]) {
case '(':
if (!$quote) {
if ($parenthesis) {
throw new Exception('double open parenthesis');
} else {
echo PHP_EOL;
$parenthesis = true;
}
}
$escape = false;
break;
case ')':
if (!$quote) {
if ($parenthesis) {
$parenthesis = false;
} else {
throw new Exception('closing parenthesis without open');
}
}
$escape = false;
break;
case '\':
$escape = !$escape;
break;
case "'":
if ($escape) {
$escape = false;
} else {
$quote = !$quote;
}
break;
default:
$escape = false;
break;
}
echo $line[$i];
}
}
fclose($input);
But then I realized it was too slow, so I rewrote it in C, using strcspn to find string occurence:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
// process-mysqldump.c
// gcc -O2 -Wall -pedantic process-mysqldump.c -o process-mysqldump
// Usage: cat dump.sql | process-mysqldump
// Or : process-mysqldump dump.sql
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#define BUFFER 100000
bool is_escaped(char* string, int offset) {
if (offset == 0) {
return false;
} else if (string[offset - 1] == '\') {
return !is_escaped(string, offset - 1);
} else {
return false;
}
}
bool is_commented(char* string) {
char buffer[4];
sprintf(buffer, "%.3s", string);
return strcmp(buffer, "-- ") == 0;
}
int main(int argc, char *argv[])
{
FILE* file = argc > 1 ? fopen(argv[1], "r") : stdin;
char buffer[BUFFER];
char* line;
int pos;
int parenthesis = 0;
bool quote = false;
bool escape = false;
bool comment = false;
while (fgets(buffer, BUFFER, file) != NULL) {
line = buffer;
// skip commented
if (comment || is_commented(line)) {
comment = line[strlen(line) - 1] != '\n';
fputs(line, stdout);
} else {
pos = 0;
nullchar:
while (line[pos] != '\0') {
// if we are still in escape state, we need to check first char.
if (!escape) {
// find any character in ()'
pos = strcspn(line, "()'\\");
}
if (pos > 0) {
// print before match
printf("%.*s", pos, line);
}
switch (line[pos]) {
case '(':
if (!quote) {
if (parenthesis == 0) {
putchar('\n');
}
parenthesis++;
}
if (escape) {
escape = false;
}
break;
case ')':
if (!quote) {
if (parenthesis > 0) {
parenthesis--;
} else {
// whoops
puts("\n");
fputs(line, stdout);
fputs("Found closing parenthesis without opening one.\n", stderr);
exit(1);
}
}
if (escape) {
escape = false;
}
break;
case '\\':
escape = !escape;
break;
case '\'':
if (escape) {
escape = false;
} else {
quote = !quote;
}
break;
case '\0':
goto nullchar;
default:
if (escape) {
escape = false;
}
break;
}
// print char then skip it (to make sure we don’t double match)
putchar(line[pos]);
line = line + pos + 1;
pos = 0;
}
}
}
return 0;
}
The only flaw that I can think of is that the parser will fail if the 10001st character of a line is an escaped quote, it will see it as an unescaped quote.
Happy dumping !