So I was recently helping a client with an issue in MySQL where a migration failed to transfer the full contents of some fields. This amounted to a little over 1% of the total messages transferred. In doing some research, we discovered that the one thing every message had in common was the presence of multi-byte (high unicode) characters. In many cases, this was due to a user pasting some text from Microsoft Word.
Remember, there is a correct way to dump and restore data from MySQL, and it doesn’t involve piping output. That was the root cause of this problem.
So, how to clean up the mess? Well, fortunately, we still had the old database up. So we needed to figure out what messages might be suspicious and need repair. That means finding fields that contain unicode characters.
There are several other articles on the Internet, but they are mostly wrong. They all tell you to do things like this:
THIS IS WRONG. DO NOT DO THIS. User eggyal in this Stack Overflow thread :explains why
Suppose, for example, that one has a UTF-16 column containing ‘ā’ (encoded by the byte sequence 0x0101) - it would be deemed “ASCII” using this test: a false negative; indeed, some character sets do not encode ASCII characters within 0x00 to 0x7f whereupon this solution would yield a false positive.
Many character sets are fixed-length and so LENGTH(column) will be a constant multiple of CHAR_LENGTH(column) irrespective of the value.
This is the correct way:
Ollie Jones explains why:
The
CONVERT(col USING charset)
function will turns the unconvertable characters into replacement characters. Then, the converted and unconverted text will be unequal.
Armed with this, I was able to write a program to query all the suspicious messages and compare them with the data in the old database. Where they were not the same, we were able to restore the old data.