I’ve been told I will have a very large file to import into MySQL, 160 million rows. I don’t have the file yet but I wanted to see how my queries would work with such a database so I generated some content. It’s a database of SHA1 hashes which I will store as strings of length 32 in base 32 (26 letter alphabet and digits 2-7). In base 16 (0-9 and A-F), they would be 40 characters long.
First task is to generate the data. I first wrote some SQL queries to generate random strings (stored procedure, RAND(), array of valid characters). Based on some smaller scale tests, I estimate this would take 80 hours on an InnoDB table. Much too long, so I tried taking the SHA1() of a random number. This was estimated to take 47 hours on an InnoDB table but only 75 minutes on a MyISAM table. Apparently, this is due to InnoDB keeping a log of everything so it can roll back the transaction if something goes wrong. Still too long so another option I tried was to generate the entries with a short Python script using hashlib.sha1(str(i)).hexdigest() where i is an int between 0 and 159,999,999. This only took about 9 minutes and the total file size was 6.5 GB.
Second task is to get the data into the database. InnoDB is too slow. MyISAM took about 10 minutes to import the text file (one value per line) using the load data infile local command. This is conveniently handled via the GUI in HeidiSQL. The total database size was 22 GB, but I think I made the char field a little too big (40 characters long instead of 32).
Searching for a single value in this table took close to 4 minutes, so I added an index on the only column. The index took several hours to generate and added 7 GB to the table size. The basic select query now takes less than 0.1 seconds.
The next day, I wanted to repeat this and see if increasing the value of the myisam_sort_buffer_size variable would speed up the index creation. Unfortunately, I created an InnoDB table (by default) and didn’t change the engine to MyISAM before starting the bulk import. Several hours later, I killed HeidiSQL, then loaded the MySQL command line client, ran show full processlist and kill 2 where 2 was the process id of my import query. This didn’t stop the constant hard drive access because the InnoDB engine then started to roll back the import. Apparently, this can take even longer than the import itself and is impossible to stop even if you stop and start mysqld or the whole server. The rollback operation always continues automatically when you restart the server. This isn’t quite true and I came across a page in the documentation which explains that you can set innodb_force_recovery to the value 3 in the configuration file to prevent rollback. Then you restart the server, go to the database and drop the problematic table, delete the innodb_force_recovery setting or put it back to zero, then restart the server again. There were a few issues with the server not stopping correctly and I had to reboot a few times but it did work and I was able to access the database again.
Sorry for the lack of references.