Archive for February, 2007

Data Parsing and Conversion

Wednesday, February 28th, 2007

Some time ago someone at work asked me if I know anything about data mining. I replied that I sure do. Hell, I almost wrote a thesis on automated blind database integration. It was actually quite fascinating - we used modified TF-IDF scoring algorithm to discover which columns in which tables are semantically similar. Unfortunately my mentor got really sweet job at Oracle and left academia. That’s how I ended up doing hyperspectral imaging.

I figured that nothing the business world could throw at me would be as complex as the problem I dealt with - ie: “here are two databases, we know nothing about their schemas - now merge them”.

Little did I know that in the world of asset based lending, and accounting “data mining” roughly translates to “converting various files into excel format”. And of course you don’t even get to touch the database - oh no! You get a long flat file regurgitated by the client’s reporting system usually in a PDF format. Of course most of this data is designed for printer output - meaning pages the data into sizable chunks, and fits them with headers, footers, and all kinds of additional human friendly bullshit.

If you get lucky and get a text file, you can hammer it with awk or perl and get it formated into a nice csv format that can be then imported into excel. If you get a PDF, you are kinda fucked.

Fortunately people have devised software for this kind of work. I’m currently using Monarch Pro which is essentially an adjustable sieve for data files. You get to specify the shape of your data (it can do some primitive pattern matching, but not regexps cause that would actually make sense) then shake it an an excel file should fall out. Unfortunately just like in a real sieve you can only adjust it in a limited way. For example you can only have 1 distinct class of collected data per file, and that class must be represented as a single line/row. Let me show you:

Viewing a Report in Monarch

In the image above you can see a PDF report opened in Monarch. I blurred company names and other info for obvious reasons (not that you would get much out of this anyway). As you can see the file has headings, information lines, subtotals and etc all over the place. But the actual data can probably be best represented by the highlighted line. It can be used to create a “template”:

Creating A Detail Template

As you can see, we are creating a “detail” template. You can only have one detail template which is retarded IMHO but oh well. First thing we do is we tell Monarch how to find out which lines are significant. In this case it is simple. Each record line that contains actual “meat” starts with an account number. So I simply make Monarch grab each line which has 6 digit numerical string in that specific position. You can do this sort of pattern matching using the little buttons. The accented N means any numerics, the A means any literals, beta means blank spaces, and empty set means non-blank characters. Of course you can’t do regexps because that would actually be useful, and we can’t have that in a software designed for a business sector.

Now you define fields by highlighting them in the yellow box. Once your done your detail template should look something like this:

Defining Fields

Of course there is some data around the “detail” that we might want to capture. Pay attention now, because this is the only thing in Monarch that I didn’t intuitively get at the first try. You can create what they call “append” templates. What they do, is they incorporate the data found on a given line into the “detail”. The fields in the append template will become columns in your finished table, and will be repeated down as needed. I will show you exactly how this works. First let’s make an “append” template. It’s done the exact same way - you highlight a line, say “New Template” and then provide some pattern matching hook and define fields:

Creating an Append Template

Here my hook is the word Limit repeating on each subheading in the exact same position, and my fields are the company name, the limit value, last payment and etc. Once we have all the templates set, we can switch to the table view and see the fruits of our efforts:

Table View

As you can see, the append fields, became columns and their values get repeated downwards for each “detail” row associated with that append block. We can now export it into a range of different data formats, out of which Excel is probably the most desirable for the business people.

There is one thing I haven’t figured out with Monarch yet. How to include subtotals in the table. Underneath each block of data in my file, there is a company subtotal. It does not make sense to capture this info as an “append” because I don’t want to see it repeated.

It also doesn’t make sense to include it in detail, because it does not conform to the detail format, and will throw off my pattern matching. If you could have more than one detail template I would probably be set, but you can’t.

So I guess there are two choices here - either leave the totals out of the file completely (after all they can be automatically generated by excel anyway) or tweak your detail template until it includes the totals.

I guess the moral of this story is that if a non-technical person asks you if you have any experience in data mining, it usually means they need someone to convert their PDF files into excel - and not to extract and merge data from relational databases. |

File Transfer Problem

Wednesday, February 28th, 2007

It’s surprising even though I run into this problem quite often, I have yet to find a perfect solution. I’m talking about file transfer. Say we have two users - let’s call them Bob and Alice. These two are inexplicably dumb, and have no clue about technology. They are in two different states, in two different time zones and they can’t exchange documents in person. Bob has some files that Alice will need in an hour or so, so snail mail is out of the question. How do the two go about exchanging the files if we assume both have broadband internet connection at their location?

Simplest solution of course is to email the files. This is what the tech idiots do all the time after all. And it works, unless of course Bob’s file is roughly 1GB. In such a situation email becomes useless, since most servers will refuse to handle attachments of that size.

I had this situation happen to me today, and I was at a loss. I kept running different scenarios in my head, and I could not find a solution. I need a simple, no hassle solution that would allow transferring of the file with no complicated setup involved.

First thing I thought off was IM. Most of modern IM protocols support have file transfer tools, which could potentially allow for transferring our large file. Unfortunately, Bob uses AIM, while Alice prefers MSN and so they can’t talk to each other. Registering a new account is tedious and annoying, and neither one wants to do it. So it’s out of the question.

Bob could set up an FTP server… But unfortunately he is an idiot. Furthermore, both of them also live behind NAT’s and firewalls so any kind of server-client communication will be very difficult. This includes setting up a torrent tracker and seeding files.

Using 3rd party services such as SendIt is out of question because neither Bob nor Alice want the file sitting on some random server for an unknown amount of time. They are bound by confidentiality agreements, and etc…

So how do we get them to exchange files?

The best I could do for them was to set up an FTP server in the office and have Bob upload the file to it, only to be downloaded by Alice. Not a perfect solution, but a workable one.

Unfortunately before I was able to configure IIS and punch a hole in the firewall and send them instructions, they already got annoyed and gave up. They were strapped for time. So in the end they opted for faxing each other relevant pages and snail mailing the rest.

But the problem is still here. If we remove the presence of a central office (here me) then Bob and Alice would still be without means to exchange files…How do you usually do this?

You know you are working at a small company if…

Tuesday, February 27th, 2007

You know that you are working at a small company if your employers official website has lower google page rank than your personal blog. The various online tools tell me that my page rank is 4. My employers rank is 2. Go figure. mrgreen Note that neither one of us engaged in any SEO activities.

I stumbled upon this realization when I tried to google the page rank of our website at work for shits and giggles. My boss just invested some money into some online advertising and he wanted to see if we get more traffic. So I dug out the server logs that dated all the way back to 2004 and parsed through them with Analog to make nice looking report with bar and pie charts. It turns out that it worked - Jan and Feb showed a huge spike in traffic, which means the advertising is working. I’m not sure how much more business are we getting out of this, but we do get more hits than usual. After going through that data, I figured I might as well check out page rank… And compare it to my own.

And it seems that google considers me to be much more interesting. )

Rails: #28000Access denied

Tuesday, February 27th, 2007

I’m continuing to mess around with Ruby on Rails on WinXP. I just ran into another problem that took me over an hour to figure out. I set up a database in MySQL, configured my database.yml file, created a model and a controller and launched WEBrick. I put the scaffold line in the controller to generate auto-magical interface and I kept getting this error message:

#28000Access denied for user ‘root’@'localhost’ (using password: NO)
RAILS_ROOT: ./script/../config/..

This was driving me crazy, because my yml file specified to use the “webuser” account instead of the root. And yet, rails insisted on fucking around with the root account. I checked it about a hundred times, changed access privileges on my table every possible way, and scoured the web for similar problems. It seems that quite a few people are running into the same damn issue, and almost no one knows hot to fix it.

Finally after reading through the comments here, I found the solution. Simply run the server in production mode:

ruby script/server -p80 -environment=production

This worked for me just fine. I’m not sure why this was happening. Any ruby experts out there can explain this behavior to me? Oh well, for now I guess that as long as I stay in production, I can avoid that stupid error.

Btw, rails is pretty amazing when it works. It only takes one line of code, and you get a simple, yet surprisingly custom complete interface allowing me to populate, edit and remove stuff from your database table:

Rails Scaffold

How awesome is that?

Strange WEBrick Error

Monday, February 26th, 2007

I was messing around with ruby on my Windows box and for some reason WEBrick kept crashing on me with the following output:

ruby script/server
=> Booting WEBrick…
=> Rails application started on http://0.0.0.0:3000
=> Ctrl-C to shutdown server; call with –help for options
[2007-02-26 03:14:11] INFO WEBrick 1.3.1
[2007-02-26 03:14:11] INFO ruby 1.8.5 (2006-08-25) [i386-mswin32]
[2007-02-26 03:14:11] WARN TCPServer Error: Bad file descriptor - bind(2)
c:/ruby/lib/ruby/1.8/webrick/utils.rb:73:in `initialize’: Bad file descriptor - bind(2) (Errno::EBADF)
from c:/ruby/lib/ruby/1.8/webrick/utils.rb:73:in `new’
from c:/ruby/lib/ruby/1.8/webrick/utils.rb:73:in `create_listeners’
from c:/ruby/lib/ruby/1.8/webrick/utils.rb:70:in `each’
from c:/ruby/lib/ruby/1.8/webrick/utils.rb:70:in `create_listeners’
from c:/ruby/lib/ruby/1.8/webrick/server.rb:75:in `listen’
from c:/ruby/lib/ruby/1.8/webrick/server.rb:63:in `initialize’
from c:/ruby/lib/ruby/1.8/webrick/httpserver.rb:24:in `initialize’
from c:/ruby/lib/ruby/gems/1.8/gems/rails-1.2.2/lib/webrick_server.rb:58:in `new’
… 7 levels…
from c:/ruby/lib/ruby/gems/1.8/gems/rails-1.2.2/lib/commands/server.rb:39
from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:27:in `gem_original_require’
from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:27:in `require’
from script/server:3

A quick google indicates that I’m not the only one having this issue but it does seem very rare. Curiously enough the solution turned out to be using a different port for the server. For example I did:

ruby script/server -p8080

That seemed to work just fine. Strange… I’m wondering if this might have something to do with my AV or Firewall blocking connections on port 3000 for some reason…

Update 02/26/2007 12:48:56 PM

It seems that Rich found an explanation for this behavior. The workaround is really simple:

Take that fucking PocketPC out of the fucking cradle and run WEBrick again.

It appears that the wcescomm.exe hugs the port 3000 when Microsoft ActiveSync is running.

How to build a Crawler

Sunday, February 25th, 2007

This made me chuckle:

#732536 +(361)- [X]

<El_Pompo> what would be the best language to build a crawler in?

<Emetri> jawa.

It’s a quote from bash.org of course. It might take you few seconds, but you will get it. Or not. Either way, I’m not explaining it.

Ok, here is a hint:

Hint

Get it?

I think I want a Koala!

Sunday, February 25th, 2007

I think I want a Koala! No, not the bear you silly people. What the hell would I do with Koala bear. Although, I have to admit it would be kinda awesome conversation piece. Chicks dig little cute fuzzy critters like that too so it possibly could get me laid or something.

“Oh, that? That’s just my Koala bear. He eats eucalyptus leafs and poops golden nuggets. Yup, that’s what they do. So…. You wanna do it?”

Yup, it would be quite awesome. Quite awesome indeed. But I digress… What I wanted to talk about is this:

Koala WhiteKoala Black

Koala is a mini desktop system from System 76 that imitates the Mac Mini in shape and size. It has decent hardware specs for it’s size, comes preloaded with Ubuntu (Edgy) and sells for $600. I’m tempted to buy it because of the size. Right now, I’m strapped for space and I have no where to put another computer. But I could just plop Koala anywhere, and since it is a Linux machine it could basically run headless. It also has wifi, so all I really need is to set it up once, then just plug it into a power socket and I have a running linux server that takes virtually no space.

I would essentially use as an SSH server which would be the entry point to my network from the outside. I could probably also run several other services on it for my personal use.

Yes, I could buy a Mac Mini for the same price, but I already decided that I will most likely buy a MacBook at some point this year and I kinda really want a Linux machine running in the house. And I totally want to support a business which sells Linux based systems - there are so few of them these days.

Does anyone here have any experience with System 76? How is their hardware? Are they reliable? Do they offer decent support and warranty? Let me know.

Site Outage Yesterday

Sunday, February 25th, 2007

The site was down between 3 am and 7am EST today. I apologize for the inconvenience, but this was beyond my control. My host was doing maintenance in their data center, and had to shut the power down in their building.

Anyways, sorry for the interruptions. All systems should be up and running by now.

Names: Choose Them Wisely

Sunday, February 25th, 2007

Alex Papadimoulis from The Daily WTF just decided to rename his website to Worse than Failure. If you are not familiar with his site, you should definitely check it out. He posts a daily WTF stories from the IT industry sent in by readers which are usually good for a chuckle or two.

It’s funny, but I always figured out that the name suited the content perfectly. However it seems that Alex has grown tired trying to explain his websites title to people in the real world. He specifically mention the awkward situation in which he was asked by his grandmother to define the acronym WTF. Personally, I do not consider this term to be vulgar, but someone who never heard it before might possibly be offended by it.

An awesome-cool name is not always a good name in the long run even if it describes your content well. A popular website with large reader base can be a great asset. But if your domain name contains the term WTF it might not be the best idea to put it on your resume.

A poorly chosen name can also be a factor stunting the growth of your readership. For one, I know that my own site name is a mouthful. In addition my domain name is long, and has a hypen in the middle of it, making it difficult to remember. In retrospect, I have to admit that it was probably a poor decision. So at some point in the future I may need to rethink this name, and perhaps change it into something shorter, and easier to remember.

Fortunately the name of this website would probably never put me in an awkward or embarrassing position like the one Alex described. Even though it’s hard to spell and pronounce, it is clean. What would get me in trouble though is the content, and I tend to swear, insult people and voice my political opinions here. mrgreen

Choosing good names that are both catchy, memorable and fun is hard. This is why people in marketing can actually be useful. Normally upon sighting marketroid I would usually “kill it with fire” but I have to admit, that they can actually be useful sometimes if controlled properly.

Verizon: .002 Cents

Saturday, February 24th, 2007

Lately everyone complains about the call centers in India. Sadly the truth is that when you call a large company, the first line of support operators will be equally dumb no matter where the call center is located. Because of the volume of calls that you get at large corporations, it does not make sense to put skilled technicians in a call center. What you do is you hire the dumbest, least qualified, phone drones by a truckload and have them filter the calls. Important issues get escalated, and usually reach someone with some expertise. But before it does, you will probably go through 3-4 phone calls, and about as many escalations. It sucks ass but that’s how they do it.

The sad truth is that people manning the support lines stateside can be so dumb that it hurts. For example, it appears that no one at the Verizon call center can’t tell a difference between .002 cents and .002 dollars. I shit you not. Listen to this audio clip of recorded support call (note, it’s really long):

The story here is this: the caller calls Verizon and asks them how much will they charge him for data transfer in Canada. They quote him .002 cents per KB. He actually makes them write it down in the notes. Later he gets a bill which charges him .002 dollars per KB. He calls back to contest this charge, and neither the support drone, nor his manage, and not even the floor manager can understand what he is talking about.

It actually hurts my brain to listen to this conversation. If I didn’t hear it, I would not believe that this exchange actually took place. I realize people are dumb, but I didn’t suspect that you could be this dumb and still function in the society.

Fortunately it seems that the guy who created this recording finally got high enough in the Verizon command chain that he actually encountered someone with a brain, who gave him a full refund.
Unfortunately their rank and file continue to quote people the wrong price.


Bad Behavior has blocked access attempts in the last 7 days.