Adding newlines in a SQL mysqldump to split extended inserts

4 minute read

The official mysqldump supports more or less two output styles: separate INSERTs (one insert statement per row) or extended INSERTs (one insert per table). Extended INSERTs are much faster, but MySQL write them all in one line, the result being a SQL very hard to read. Can we get the best of both worlds ?

Separate INSERTs

INSERT INTO mytable (id) VALUES (1);
INSERT INTO mytable (id) VALUES (2);

Extended INSERTs

INSERT INTO mytable (id) VALUES (1),(2);

New-And-Improved™ INSERTs

INSERT INTO mytable (id) VALUES
(1),
(2);

Current solutions

Using sed

mysqldump --extended-insert | sed 's/),(/),\n(/g'

Only problem is, lines will be split, even in the middle of strings, altering your data.

Using net_buffer_length

mysqldump --extended-insert --net_buffer_length=5000

mysqldump will make sure lines are not longer than 5000 (or whatever), starting a new INSERT when needed. The problem is that the behaviour is kinda random, diffs are hard to analyze and it may break your data if you are storing columns longer than this.

Writing a parser

This question has been often asked without a proper reply, so I decided to write a simple parser. Precisely, we need to check for quotes, parenthesis, and escape characters.

I first wrote it in PHP:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
<?php
# process-mysqldump.php

// Usage: cat dump.sql | php process-mysqldump.php

$input = fopen('php://stdin', 'r');

while(!feof($input)) {
    $line = fgets($input);
    if (substr($line, 0, 6) == 'INSERT') {
        process_line($line);
    } else {
        echo $line;
    }
}

function process_line($line) {
    $length = strlen($line);

    $pos = strpos($line, ' VALUES ') + 8;
    echo substr($line, 0, $pos);

    $parenthesis = false;
    $quote = false;
    $escape = false;

    for ($i = $pos; $i < $length; $i++) {
        switch($line[$i]) {
            case '(':
                if (!$quote) {
                    if ($parenthesis) {
                        throw new Exception('double open parenthesis');
                    } else {
                        echo PHP_EOL;
                        $parenthesis = true;
                    }
                }
                $escape = false;
                break;

            case ')':
                if (!$quote) {
                    if ($parenthesis) {
                        $parenthesis = false;
                    } else {
                        throw new Exception('closing parenthesis without open');
                    }
                }
                $escape = false;
                break;

            case '\':
                $escape = !$escape;
                break;

            case "'":
                if ($escape) {
                    $escape = false;
                } else {
                    $quote = !$quote;
                }
                break;

            default:
                $escape = false;
                break;
        }

        echo $line[$i];
    }
}

fclose($input);

View Gist

But then I realized it was too slow, so I rewrote it in C, using strcspn to find string occurence:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
// process-mysqldump.c
// gcc -O2 -Wall -pedantic process-mysqldump.c -o process-mysqldump
// Usage: cat dump.sql | process-mysqldump
//   Or : process-mysqldump dump.sql

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

#define BUFFER 100000

bool is_escaped(char* string, int offset) {
    if (offset == 0) {
        return false;
    } else if (string[offset - 1] == '\') {
        return !is_escaped(string, offset - 1);
    } else {
        return false;
    }
}

bool is_commented(char* string) {
    char buffer[4];

    sprintf(buffer, "%.3s", string);

    return strcmp(buffer, "-- ") == 0;
}

int main(int argc, char *argv[])
{
    FILE* file = argc > 1 ? fopen(argv[1], "r") : stdin;

    char buffer[BUFFER];
    char* line;
    int pos;
    int parenthesis = 0;
    bool quote = false;
    bool escape = false;
    bool comment = false;

    while (fgets(buffer, BUFFER, file) != NULL) {
        line = buffer;

        // skip commented
        if (comment || is_commented(line)) {
            comment = line[strlen(line) - 1] != '\n';
            fputs(line, stdout);
        } else {
            pos = 0;

            nullchar:
            while (line[pos] != '\0') {
                // if we are still in escape state, we need to check first char.
                if (!escape) {
                    // find any character in ()'
                    pos = strcspn(line, "()'\\");
                }

                if (pos > 0) {
                    // print before match
                    printf("%.*s", pos, line);
                }

                switch (line[pos]) {
                    case '(':
                        if (!quote) {
                            if (parenthesis == 0) {
                                putchar('\n');
                            }
                            parenthesis++;
                        }
                        if (escape) {
                            escape = false;
                        }
                        break;

                    case ')':
                        if (!quote) {
                            if (parenthesis > 0) {
                                parenthesis--;
                            } else {
                                // whoops
                                puts("\n");
                                fputs(line, stdout);
                                fputs("Found closing parenthesis without opening one.\n", stderr);
                                exit(1);
                            }
                        }
                        if (escape) {
                            escape = false;
                        }
                        break;

                    case '\\':
                        escape = !escape;
                        break;

                    case '\'':
                        if (escape) {
                            escape = false;
                        } else {
                            quote = !quote;
                        }
                        break;

                    case '\0':
                        goto nullchar;

                    default:
                        if (escape) {
                            escape = false;
                        }
                        break;
                }

                // print char then skip it (to make sure we don’t double match)
                putchar(line[pos]);
                line = line + pos + 1;
                pos = 0;
            }
        }
    }

    return 0;
}

View Gist

The only flaw that I can think of is that the parser will fail if the 10001st character of a line is an escaped quote, it will see it as an unescaped quote.

Happy dumping !

Share on

X Facebook LinkedIn Bluesky

Sebastien Lavoie

Adding newlines in a SQL mysqldump to split extended inserts

Separate INSERTs

Extended INSERTs

New-And-Improved™ INSERTs

Current solutions

Using sed

Using net_buffer_length

Writing a parser

Share on

You May Also Enjoy

Override Doctrine ODM mapping for Symfony FOSUserBundle

Preventing cache stampede when using Doctrine Cache

Load Doctrine Fixtures when using Migrations in Symfony

MySQL 5.7 and Wordpress problem