You are currently browsing the archives for the Programming category.
Still working on importing my large (> 6 GB, 160 million rows) data file into MySQL. The only way to do this in a reasonable time seems to be to use LOAD DATA LOCAL INFILE. This works on the command line but it’s not straightforward to use in my webapp (the size of the file might cause other aspects, such as the upload, to fail but that’s another problem). I don’t have access to the filename because I don’t know where it’s been uploaded to. The Primefaces upload widget only gives me an InputStream, not the path to the uploaded file.
The MySQL JDBC driver, Connector/J, provides a special feature described at the bottom of the implementation notes: setLocalInfileInputStream() which will allow me to use LOAD DATA LOCAL INFILE but read from a stream instead of a file. You still need to include a filename in the command but it is ignored.
A problem appears when trying to access this method. It’s part of com.mysql.jdbc.Statement, not java.sql.Statement. The MySQL version implements the java.sql interface, and in principle, I should be using the interface rather than the specific implementation. However, I really need this particular feature of Connector/J in order to load my large file. In Glassfish, my datasource has been set up to provide connections to the database using the Connector/J library (i.e., the MySQL driver), but the Connection and Statement objects it provides are java.sql.Connection and java.sql.Statement respectively. Casting them to com.mysql.jdbc.... doesn’t work. I tried casting the DataSource to com.mysql.jdbc.jdbc2.optional.MysqlDataSource but that throws a ClassCastException too.
There is a method called unwrap() in the Connection and Statement classes. These give access to the underlying implementation. See this page about connection wrapping in the Oracle docs. The snippet which works for me is given below. DataSource, Connection, and PreparedStatement are all imported from java.sql.
DataSource ds = (DataSource) ctx.lookup("jdbc/my-db-connection");
try (Connection conn = ds.getConnection()) {
PreparedStatement statement = conn.prepareStatement("LOAD DATA LOCAL INFILE 'bla' IGNORE INTO TABLE knownhashvalues (hashValue)");
// Unwrap the native mysql statement if possible
if (statement.isWrapperFor(com.mysql.jdbc.Statement.class)) {
com.mysql.jdbc.PreparedStatement mysqlStatement = statement.unwrap(com.mysql.jdbc.PreparedStatement.class);
mysqlStatement.setLocalInfileInputStream(is);
mysqlStatement.execute();
} else {
throw (new SQLException("Unable to unwrap database connection to use MySQL specific commands"));
}
}
In practice, to speed things up, I disable indexing before loading the data and I re-enable it afterwards. Because re-enabling indexes can take a long time, I temporarily change the myisam_sort_buffer_size. So my full code looks like this.
DataSource ds = (DataSource) ctx.lookup("jdbc/my-db-connection");
try (Connection conn = ds.getConnection()) {
PreparedStatement statement = conn.prepareStatement("LOAD DATA LOCAL INFILE 'bla' IGNORE INTO TABLE knownhashvalues (hashValue)");
// Unwrap the mysql native statement if possible
if (statement.isWrapperFor(com.mysql.jdbc.Statement.class)) {
com.mysql.jdbc.PreparedStatement mysqlStatement = statement.unwrap(com.mysql.jdbc.PreparedStatement.class);
Statement st = conn.createStatement();
st.executeUpdate("ALTER TABLE knownhashvalues DISABLE KEYS");
mysqlStatement.setLocalInfileInputStream(is);
mysqlStatement.execute();
st.executeUpdate("SET myisam_sort_buffer_size = 1073741824");
st.executeUpdate("ALTER TABLE knownhashvalues ENABLE KEYS"); // Could be slow.
st.executeUpdate("SET myisam_sort_buffer_size = 26214400");
} else {
throw (new SQLException("Unable to unwrap database connection to use MySQL specific commands"));
}
}
I’ve been told I will have a very large file to import into MySQL, 160 million rows. I don’t have the file yet but I wanted to see how my queries would work with such a database so I generated some content. It’s a database of SHA1 hashes which I will store as strings of length 32 in base 32 (26 letter alphabet and digits 2-7). In base 16 (0-9 and A-F), they would be 40 characters long.
First task is to generate the data. I first wrote some SQL queries to generate random strings (stored procedure, RAND(), array of valid characters). Based on some smaller scale tests, I estimate this would take 80 hours on an InnoDB table. Much too long, so I tried taking the SHA1() of a random number. This was estimated to take 47 hours on an InnoDB table but only 75 minutes on a MyISAM table. Apparently, this is due to InnoDB keeping a log of everything so it can roll back the transaction if something goes wrong. Still too long so another option I tried was to generate the entries with a short Python script using hashlib.sha1(str(i)).hexdigest() where i is an int between 0 and 159,999,999. This only took about 9 minutes and the total file size was 6.5 GB.
Second task is to get the data into the database. InnoDB is too slow. MyISAM took about 10 minutes to import the text file (one value per line) using the load data infile local command. This is conveniently handled via the GUI in HeidiSQL. The total database size was 22 GB, but I think I made the char field a little too big (40 characters long instead of 32).
Searching for a single value in this table took close to 4 minutes, so I added an index on the only column. The index took several hours to generate and added 7 GB to the table size. The basic select query now takes less than 0.1 seconds.
The next day, I wanted to repeat this and see if increasing the value of the myisam_sort_buffer_size variable would speed up the index creation. Unfortunately, I created an InnoDB table (by default) and didn’t change the engine to MyISAM before starting the bulk import. Several hours later, I killed HeidiSQL, then loaded the MySQL command line client, ran show full processlist and kill 2 where 2 was the process id of my import query. This didn’t stop the constant hard drive access because the InnoDB engine then started to roll back the import. Apparently, this can take even longer than the import itself and is impossible to stop even if you stop and start mysqld or the whole server. The rollback operation always continues automatically when you restart the server. This isn’t quite true and I came across a page in the documentation which explains that you can set innodb_force_recovery to the value 3 in the configuration file to prevent rollback. Then you restart the server, go to the database and drop the problematic table, delete the innodb_force_recovery setting or put it back to zero, then restart the server again. There were a few issues with the server not stopping correctly and I had to reboot a few times but it did work and I was able to access the database again.
Sorry for the lack of references.
The MySQL data files and logs (slow log, global log, error log…) are under C:\ProgramData\MySQL\MySQL Server 5.5\data. The ProgramData directory is hidden so you need to change the folder settings. There is a data folder in C:\Program Files too but it doesn’t seem to be active.
When you use Subversion (or other version control systems), be sure to use them consistently. Checked out directories contain a hidden directory containing information about the repository, the original checked out files, which files have been deleted or added… Copying the complete directory to a different location or even a different computer might work seamlessly, copying individual files is also acceptable. But as soon as you copy subdirectories (with their special hidden directory) that have been updated or committed on another computer, the upper directory and the subdirectory become inconsistent as far as version control is concerned. You will get messages asking you to clean up, and claiming that some subdirectories are not under version control.
- Best practice is to use the version control tools to exchange files: commit on one computer, checkout on another, make changes, commit, update on the first computer.
- If individual files need to be processed on another computer (eg. batch converting images from one format to another on a *nix machine) then proceed as above, or select all and copy the individual files, not the whole directory.
- If you need to provide the whole version controlled directory to someone who doesn’t need version control (eg. sending LaTeX sources to an editor), use the export feature to create a clean directory tree without all the hidden version control subdirectories.
Trying to get some very basic marker detection working for an indoor localisation project. After days of searching the web I still had to call on Jayson to help me out. After so many failures, it was a surprise and a relief that the solution below just worked. It’s a shame most of these AR projects seem to be out of date or undocumented.
Compiling ARToolKitPlus
Get ARToolKitPlus from Launchpad. As I write this in May 2011, the latest version is 2.2.1 released on 2011-02-05. Download the tar.bz2 archive.
tar xvjf ARToolKitPlus-2.2.1.tar.bz2
cd ARToolKitPlus-2.2.1
mkdir build
cd build
cmake ..
make
sudo make install
cmake-gui is a useful alternative to cmake if you need to change the configuration.
Sample code
There is a very simple example in sample/simple. Compile it with g++ main.cpp -o main -lARToolKitPlus. Run it from the sample directory so it can find its sample image.
Apparently the simple tracking locates markers individually, but can still locate multiple markers in each frame, and the multi tracking locates several markers that are assumed to be on the same plane and computes the equation of that plane.
Camera calibration
Calibrating the camera (to compensate for distortion) is important if you want to estimate accurate positions of the markers relative to the camera.
For camera calibration, follow the instructions from the bottom of the Studierstube website. Here are some instructions on how to use the toolbox.
- Download the Matlab calibration toolbox by Jean-Yves Bouguet.
- Take about 10 pictures of a chequerboard with the camera you wish to calibrate. Store them in the
toolbox_calib directory along with the Matlab code you just downloaded. File names must end with an incrementing number (before the file format extension). Format can be ras, bmp, tif, pgm, jpg or ppm.
- Run the
calib script.
- Choose the standard method.
- Image names or Read images seem to do the same thing. If your files are called 0.bmp, 1.bmp, 2.bmp… then hit return for the basename and enter ‘b’ for the format.
- Extract grid corners.
- Hit return without entering anything to process all images.
- Depending on the pixel size of your screen, the resolution of your camera and how accurately you can click, you may need to increase the wintx and winty parameters. These are half the size in pixels of the box within which the toolbox looks for grid corners. 20 was a good value for my 5×7 chequerboard with 40mm squares and a 640×480 camera.
- Hit return again to use the automatic square counting mechanism or enter your own values. Count the number of squares in each direction on your calibration board and subtract 2 from each (in other words, don’t count the outer squares).
- Click the four corners of the board, starting at the top left, and going clockwise.
- Enter the size of the squares on the board in millimetres. You only get asked this for the first image. If you get it wrong you need to close the toolbox and restart (maybe clear the workspace too).
- If your camera and lens distort the image a lot, the other corners in the image will not be detected properly because they aren’t spaced at predictable intervals. By entering a value between -1 and 1 for the initial distortion you can get the initial corner estimates to be closer to the real corners. I had no idea what a good value was so just tried a few until the red crosses were more or less in the right positions. Note that whether you enter an initial estimate for the distortion factor or not, the window you need to look at may be hidden behind another figure.
- The toolbox will ask you to select the four outer corners and then attempt to detect the others for all images in succession.
It shouldn’t matter too much if some of the corners aren’t detected properly. I expect they are ignored as outliers at the end.
- When you’ve detected the corners on all images, click Calibration.
- Visually check that the computed parameters make sense by clicking Undistort image. Enter 1, then the name (including number) of the image without the extension. The image should very clearly be undistorted, ie. the lines on the chequerboard should be parallel.
- Click Show calib results and use the numbers given to create a camera calibration file to use in ARToolKitPlus. The format is given on the Studierstube page:
[line1]: ARToolKitPlus_CamCal_Rev02
[line2]: xsize ysize cc_x cc_y fc_x fc_y kc1 kc2 kc3 kc3 kc5 kc6 iter
The OpenCV computer vision library can be linked against the Intel Performance Primitives to speed up certain operations. However, by default OpenCV version 2.2 will not find the libraries from IPP version 7.0.
First, install IPP. Download from Intel > Support > Resources > Developer Center > Software Developer Support > Tools and Downloads > Free – Non Commercial. Decide whether you accept the terms of the non-commercial license agreement. Click Intel® Integrated Performance Primitives (Intel® IPP) for Linux. Enter your e-mail address and download the IPP archive. You will receive a license key by e-mail. Enter this during the installation. The installer had some issues with the network (probably proxy-related) so I had to follow some additional instructions to paste a string generated by the installer into a special website and then paste the output back into the installer.
Get OpenCV from the official wiki. Extract the OpenCV source archive to a folder.
Edit CMakefile.txt in the root of the OpenCV folder. Replace the IPP section with the one proposed by Ying H on the Intel forum. This is in order to avoid missing libraries and errors such as /usr/bin/ld: cannot find -lippsmerged due to recent changes in the IPP folder structure. Make sure you have included the appropriate line for your system (eg. ia32/ or intel64/).
Create a subfolder called release or build in the OpenCV folder. Change to this new folder. Run cmake-gui ... Check the options you want to enable. Among other things, I checked BUILD_NEW_PYTHON_SUPPORT, ENABLE_SSE, ENABLE_SSE2, ENABLE_SSE41, ENABLE_SSSE3 (which I got by looking at the output of cat /proc/cpuinfo), USE_IPP, and I set IPP_PATH to /opt/intel/ipp/lib/ia32 which was the default IPP installation path on my machine. Click configure and then generate.
Close cmake-gui and run make from inside the release folder. Then sudo make install to install the libraries, include files and some executables into /usr/local/ or wherever you specified. You may need to run sudo ldconfig to make sure the OpenCV libraries are detected by your system.
Subversion repositories can be accessed via https using the dav_svn module in Apache. You can use either basic or digest authentication, digest having the advantage of not sending the credentials in clear text over the network. But if you want to give access to some users who already have local accounts on the server (with full SSH access), and also to some users who don’t need access to the full server but only to this particular repository, then you have to use basic authentication. I think this is due to the way the auth_pam module works (the one that allows Apache to authenticate local accounts). You can redirect all http requests to the https page, but then the credentials get sent twice, once for the http page, and once for the https page. Not very helpful.
The solution is to put the repository access and authentication part of the configuration inside a VirtualHost *:443 block, and put the redirection inside a VirtualHost *:80 block. The configuration below works for me but can no doubt be simplified a bit more.
# VirtualHost for all non-https requests.
<VirtualHost *:80>
# Redirect all access to svn repositories to the https address.
RewriteEngine On
RewriteRule ^/svn/?(.*) https://%{HTTP_HOST}%{REQUEST_URI}
# Do not allow any access to the subversion repository.
# Possibly not useful for us, but for an actual folder (rather than DAV) this could be a good idea.
<Location /svn>
deny from all
</Location>
</VirtualHost>
<VirtualHost *:443>
# These lines are included elsewhere in another VirtualHost *:443 section, but apparently they need to be added here too.
SSLEngine on
SSLCertificateFile /etc/apache/ssl/myssl.crt
SSLCertificateKeyFile /etc/apache/ssl/myprivkey.pem
SSLCertificateChainFile /etc/apache/ssl/mychain.crt
# It seems the location can be set to whatever you like. The DAV module will create a sort of alias between the SVNPath and the Location.
<Location /svn/mysecurerepo>
# I have no idea why this must be included, but if it's missing, you will get an error:
# svn: Repository moved temporarily to 'http://myserver/svn'; please relocate
RewriteEngine on
# An extra precaution.
# This line will prevent non-https access but access will be blocked before any redirection takes place if used outside the *:443 VirtualHost block.
SSLRequireSSL
# Point to subversion repository.
DAV svn
SVNPath /home/subversion/mysecurerepo
# Use PAM authentication for local users, but also check against users in the htpasswd and group files.
AuthBasicAuthoritative Off
AuthPAM_Enabled on
AuthPAM_FallThrough on
# Or use digest, but then PAM authentication won't work.
AuthType Basic
AuthName "My secure subversion repository"
AuthUserFile /etc/apache2/htpasswd/mysecure.htpasswd
# It seems the group file can contain a mix of PAM and htpasswd user names.
AuthGroupFile /etc/apache2/htpasswd/groups
Require group mysecuregroup
</Location>
# Include more repositories here.
</VirtualHost>
Some ideas from Dirk-Willem’s post at Stack Overflow.
Other tips from the Apache wiki’s common misconfigurations page.
A collaborator sent me an ASP webapp to demo some of his ideas. As a non-Windows user I struggled to get it running. But it’s not that difficult.
- Get a Windows computer (I’m using Windows XP Professional). You’ll need the installation CD that goes with it if the IIS webserver isn’t already installed.
- Install the IIS webserver. Go to the Control Panel, Add or Remove Programs, Add/Remove Windows Components. Check the IIS box, then click Next and follow the instructions. You’ll need to insert the installation CD.
- Extract the zip file I received to
C:\Inetpub\wwwroot. This should create a new folder which we’ll call MyWebApp. This folder has several pairs of .aspx and .aspx.cs files with matching names. The .aspx files seem to contain the web page HTML with a bit of ASP (similar to the way PHP pages work). The .aspx.cs files contain C# functions that are used by the .aspx pages. The App_Code subfolder contains some .cs with a few C# function definitions. The App_Data subfolder is empty. The Bin subfolder contains a .dll, specifically LinqToWikipedia.dll. I found I needed to replace this DLL with one downloaded from the official LinqToWikipedia website.
- Configure IIS. Control Panel, Administrative Tools, Internet Information Services, left column, Web Sites, Default Web Site, MyWebApp. Right
click, Properties. Check the Directory Browsing box. Click the Create button next to Application Name in the lower half of the page. The Application Name box should now display ‘MyWebApp’ and the button should change to read Remove. Validate changes with OK.
- Point a browser to http://localhost/MyWebApp. You should be able to view all the .aspx pages.
- Some pages need editing. Visual Studio not required but gives nice syntax highlighting. No compilation required. Just save the modified file and refresh the corresponding page in the browser.
Using urllib2 to retrieve some RDF data for SPARQLWrapper (semantic web project) but getting ‘HTTP Error 409: Conflict’. This is caused by the university proxy.
The solution is to edit /usr/share/python-support/python-sparqlwrapper/SPARQLWrapper/Wrapper.py. Add these three lines at the end of the __init__ function:
proxy = urllib2.ProxyHandler({"http" : 'http://wwwcache.lancs.ac.uk:8080'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
Found the answer in a StackOverflow post by another Lancaster Uni person.
In Matlab you can easily break figure titles, xlabels and ylabels over several lines by using a cell array of strings.
title({'First line and' 'second line'});
But for legends this notation is used to assign legend entries to different plots.
plot(data1);
plot(data2);
legend({'Data 1' 'Data 2'});
You can achieve multiline legend entries by wrapping each entry with sprintf.
legend(sprintf('First line and\nsecond line'));
This works even if you have multiple legend entries and can also be applied to other commands such as title, xlabel and ylabel.
plot(rand(100,3));
legend({sprintf('First line and\nsecond line') 'Single line' sprintf('Another broken\nline')});
xlabel(sprintf('And a line break\nhere'));
title(sprintf('Line breaks\nin legends'));