OCRopus (tm) Packages for Ubuntu Hardy
This is a poor substitute for an entry, I realise, but I've just made packages of Google's OCRopus(tm) project for Ubuntu Hardy Heron. You can grab it here. No support for apt-get/aptitude, as I'm sure it will be in Hardy+1 and don't want to dilute your /etc/apt/sources.list file with entries that will rapidly become stale. Also, it's unsigned. Sorry. The md5sum of the file is 9aee9459a6dc120a5a5537b49a67db0e if you want to verify it. It has a handful of dependencies that are all in the main distribution, so you should be able to sort those out in short order.
So, why? Well, the broad problem I'm currently trying to solve is to eliminate the massive pile of paper that has been growing over the course of years in my house. See, there was a time when I just threw out any piece of paper that came my way, with no regard for whether I'd need it again in the future. It was a care-free and happy time, and one that I shall fondly remember for the rest of my days. Then, something terrible happened: I started earning enough money that I had to file tax returns. The first few of those were painful, mostly because I had none of the bits of paper that I needed in order to handle the onslaught of uninteresting questions posed by the ATO's various forms.
So these days, I don't throw any piece of paper out, because I'm not suitably familiar with tax that I can confidently decide whether I'll need it or not in the future. That's not entirely true; it's not just the tax office that ask me to retrieve pieces of paper that I take no personal interest in; I've had cause to retrieve all manner of documents containing information that initially looked pretty ephemeral to me.
So this has come to a point where there's this whole shelf of my book-case devoted to pieces of paper that I have no personal interest in, but at one time in my life I thought might be valuable. Once any problem reaches the size of a bookshelf, it's a problem large enough that I think a computer would be a helpful tool to solve it.
My solution is one that I thought would be pretty standard: buy a sheet-feeding scanner, shove the documents into said scanner, scan them and throw away the originals. But of course, that's not actually enough; for this to actually solve the problem, I need to be able to locate any of these documents relatively trivially; I want to be able to search through them similarly to how I search my email: by full text search, tagging, date ranges etc.
Anyway, this last part requires me to find a way to extract the text from my scanned documents for indexing and searching. That means OCR. Anyway, I'll spare you the details, but everything I tried was, well, appalling. Embarassingly bad. Seriously. My test image was a pretty clean scan of a white page with plain black sans-serif text on it, and I had one app generate a page of punctuation as its output!
The bottom line is that OCRopus was far and away the best of the tools available, and since I'm anal about software installation I packaged it. I hope it's useful to you.
The other part of this story is that I also bought a scanner as part of this project. It's the first one I've ever owned. It turns out that these days you don't just buy scanners, you get a printer and fax machine and photo-copier too! The device I bought was the HP OfficeJet 6310 All-in-One, and it's awesome. I won't go on too much about what's so great about it, except to say that if you run Linux, install hplip and you'll have it working in seconds. It has an ethernet port on it, so you just plug it straight into your network and everyone can use it straight away.
But it gets better, you don't actually have to install any software at all! You can just punch in its IP address in a web-browser, and it has a web-GUI thing that will let you scan documents without using any client-side software at all! Neato! And for ~$150, you can't go wrong.
Hopefully I'll be back with something more substantial to say soon. Thanks for your patience throughout my hiatus.