Archive for February, 2007

Data Parsing and Conversion

Wednesday, February 28th, 2007

Some time ago someone at work asked me if I know anything about data mining. I replied that I sure do. Hell, I almost wrote a thesis on automated blind database integration. It was actually quite fascinating - we used modified TF-IDF scoring algorithm to discover which columns in which tables are semantically similar. Unfortunately my mentor got really sweet job at Oracle and left academia. That’s how I ended up doing hyperspectral imaging.

I figured that nothing the business world could throw at me would be as complex as the problem I dealt with - ie: “here are two databases, we know nothing about their schemas - now merge them”.

Little did I know that in the world of asset based lending, and accounting “data mining” roughly translates to “converting various files into excel format”. And of course you don’t even get to touch the database - oh no! You get a long flat file regurgitated by the client’s reporting system usually in a PDF format. Of course most of this data is designed for printer output - meaning pages the data into sizable chunks, and fits them with headers, footers, and all kinds of additional human friendly bullshit.

If you get lucky and get a text file, you can hammer it with awk or perl and get it formated into a nice csv format that can be then imported into excel. If you get a PDF, you are kinda fucked.

Fortunately people have devised software for this kind of work. I’m currently using Monarch Pro which is essentially an adjustable sieve for data files. You get to specify the shape of your data (it can do some primitive pattern matching, but not regexps cause that would actually make sense) then shake it an an excel file should fall out. Unfortunately just like in a real sieve you can only adjust it in a limited way. For example you can only have 1 distinct class of collected data per file, and that class must be represented as a single line/row. Let me show you:

Viewing a Report in Monarch

In the image above you can see a PDF report opened in Monarch. I blurred company names and other info for obvious reasons (not that you would get much out of this anyway). As you can see the file has headings, information lines, subtotals and etc all over the place. But the actual data can probably be best represented by the highlighted line. It can be used to create a “template”:

Creating A Detail Template

As you can see, we are creating a “detail” template. You can only have one detail template which is retarded IMHO but oh well. First thing we do is we tell Monarch how to find out which lines are significant. In this case it is simple. Each record line that contains actual “meat” starts with an account number. So I simply make Monarch grab each line which has 6 digit numerical string in that specific position. You can do this sort of pattern matching using the little buttons. The accented N means any numerics, the A means any literals, beta means blank spaces, and empty set means non-blank characters. Of course you can’t do regexps because that would actually be useful, and we can’t have that in a software designed for a business sector.

Now you define fields by highlighting them in the yellow box. Once your done your detail template should look something like this:

Defining Fields

Of course there is some data around the “detail” that we might want to capture. Pay attention now, because this is the only thing in Monarch that I didn’t intuitively get at the first try. You can create what they call “append” templates. What they do, is they incorporate the data found on a given line into the “detail”. The fields in the append template will become columns in your finished table, and will be repeated down as needed. I will show you exactly how this works. First let’s make an “append” template. It’s done the exact same way - you highlight a line, say “New Template” and then provide some pattern matching hook and define fields:

Creating an Append Template

Here my hook is the word Limit repeating on each subheading in the exact same position, and my fields are the company name, the limit value, last payment and etc. Once we have all the templates set, we can switch to the table view and see the fruits of our efforts:

Table View

As you can see, the append fields, became columns and their values get repeated downwards for each “detail” row associated with that append block. We can now export it into a range of different data formats, out of which Excel is probably the most desirable for the business people.

There is one thing I haven’t figured out with Monarch yet. How to include subtotals in the table. Underneath each block of data in my file, there is a company subtotal. It does not make sense to capture this info as an “append” because I don’t want to see it repeated.

It also doesn’t make sense to include it in detail, because it does not conform to the detail format, and will throw off my pattern matching. If you could have more than one detail template I would probably be set, but you can’t.

So I guess there are two choices here - either leave the totals out of the file completely (after all they can be automatically generated by excel anyway) or tweak your detail template until it includes the totals.

I guess the moral of this story is that if a non-technical person asks you if you have any experience in data mining, it usually means they need someone to convert their PDF files into excel - and not to extract and merge data from relational databases. :|

File Transfer Problem

Wednesday, February 28th, 2007

It’s surprising even though I run into this problem quite often, I have yet to find a perfect solution. I’m talking about file transfer. Say we have two users - let’s call them Bob and Alice. These two are inexplicably dumb, and have no clue about technology. They are in two different states, in two different time zones and they can’t exchange documents in person. Bob has some files that Alice will need in an hour or so, so snail mail is out of the question. How do the two go about exchanging the files if we assume both have broadband internet connection at their location?

Simplest solution of course is to email the files. This is what the tech idiots do all the time after all. And it works, unless of course Bob’s file is roughly 1GB. In such a situation email becomes useless, since most servers will refuse to handle attachments of that size.

I had this situation happen to me today, and I was at a loss. I kept running different scenarios in my head, and I could not find a solution. I need a simple, no hassle solution that would allow transferring of the file with no complicated setup involved.

First thing I thought off was IM. Most of modern IM protocols support have file transfer tools, which could potentially allow for transferring our large file. Unfortunately, Bob uses AIM, while Alice prefers MSN and so they can’t talk to each other. Registering a new account is tedious and annoying, and neither one wants to do it. So it’s out of the question.

Bob could set up an FTP server… But unfortunately he is an idiot. Furthermore, both of them also live behind NAT’s and firewalls so any kind of server-client communication will be very difficult. This includes setting up a torrent tracker and seeding files.

Using 3rd party services such as SendIt is out of question because neither Bob nor Alice want the file sitting on some random server for an unknown amount of time. They are bound by confidentiality agreements, and etc…

So how do we get them to exchange files?

The best I could do for them was to set up an FTP server in the office and have Bob upload the file to it, only to be downloaded by Alice. Not a perfect solution, but a workable one.

Unfortunately before I was able to configure IIS and punch a hole in the firewall and send them instructions, they already got annoyed and gave up. They were strapped for time. So in the end they opted for faxing each other relevant pages and snail mailing the rest.

But the problem is still here. If we remove the presence of a central office (here me) then Bob and Alice would still be without means to exchange files…How do you usually do this?

You know you are working at a small company if…

Tuesday, February 27th, 2007

You know that you are working at a small company if your employers official website has lower google page rank than your personal blog. The various online tools tell me that my page rank is 4. My employers rank is 2. Go figure. :mrgreen: Note that neither one of us engaged in any SEO activities.

I stumbled upon this realization when I tried to google the page rank of our website at work for shits and giggles. My boss just invested some money into some online advertising and he wanted to see if we get more traffic. So I dug out the server logs that dated all the way back to 2004 and parsed through them with Analog to make nice looking report with bar and pie charts. It turns out that it worked - Jan and Feb showed a huge spike in traffic, which means the advertising is working. I’m not sure how much more business are we getting out of this, but we do get more hits than usual. After going through that data, I figured I might as well check out page rank… And compare it to my own.

And it seems that google considers me to be much more interesting. :)

Rails: #28000Access denied

Tuesday, February 27th, 2007

I’m continuing to mess around with Ruby on Rails on WinXP. I just ran into another problem that took me over an hour to figure out. I set up a database in MySQL, configured my database.yml file, created a model and a controller and launched WEBrick. I put the scaffold line in the controller to generate auto-magical interface and I kept getting this error message:

#28000Access denied for user ‘root’@'localhost’ (using password: NO)
RAILS_ROOT: ./script/../config/..

This was driving me crazy, because my yml file specified to use the “webuser” account instead of the root. And yet, rails insisted on fucking around with the root account. I checked it about a hundred times, changed access privileges on my table every possible way, and scoured the web for similar problems. It seems that quite a few people are running into the same damn issue, and almost no one knows hot to fix it.

Finally after reading through the comments here, I found the solution. Simply run the server in production mode:

ruby script/server -p80 -environment=production

This worked for me just fine. I’m not sure why this was happening. Any ruby experts out there can explain this behavior to me? Oh well, for now I guess that as long as I stay in production, I can avoid that stupid error.

Btw, rails is pretty amazing when it works. It only takes one line of code, and you get a simple, yet surprisingly custom complete interface allowing me to populate, edit and remove stuff from your database table:

Rails Scaffold

How awesome is that?