Khun Yee Fung reports:
Finished the country by country reference clean up. This is the third attempt to parse the references.
The first time, it was done in Java. Even when I used the ORO class library, which attempts to provide Perl-like regular expressions in Java, it took too much effort to get the parser to be versatile and nimble.
The second time, it was done with Perl, attempting to recognize all the comments from all the countries at the same time. Well, there are so many variations with how references are made that it was very difficult to find out the context of how each reference was made. And the number of regular expressions was simply too big to maintain and modify.
Finally, this third time, one Perl script for each country. With the regular expressions for each country, it is much easier to find out the context of the references. And I got to use the site itself to check the original comments and references. The first time I ate my own dog food. I hope the references are at least 80% accurate. Will take a while to make sure they are accurate.
Combined all the comments into a page (it is about 1.5M of HTML), sorted accordig to the references I can parse, then the country code, then the comment number for the country. This is the first step of getting this site to be useful.
Will continue next week to start parsing the paragraph references. Should be a lot of fun.
September 20, 2007