I was asked to magicalize something that that will allow for categorizing companies in our database using SIC Codes. SIC is a deprecated standard. It is actually being phased out of use in favor of NAICS codes. But the powers that be specifically wanted SIC because apparently contains 30% more of awesome. Or something like that.
In fact, it was decided that we should use this list specifically. I’m pretty sure that all these codes exist somewhere out there as a nice comma separated list or another machine readable format. In fact I looked but I actually couldn’t find anything better. I figured out that the only reliable way to deal with it’s million entries is to dump the whole thing into a database table and then query it when needed. So I created a very simple table like so:
CREATE TABLE sic (
code varchar(8) NOT NULL,
description varchar(100) NOT NULL,
PRIMARY KEY (`code`)
) ENGINE=InnoDB
Then I copied the whole list from that page and pasted it into Vim for some quick processing. I figured that this would be a a good excuse to show off the power of regular expressions and the best editor in the universe. So I will show you how I got from that messy list to a neat, set of SQL statements.
First I needed to delete all the blank lines from the file. There are many ways to do it. I did it like this:
:g/^$/d
Next, I decided to delete all lines that did not start with a number. For some reason the list has some headings that shouldn’t really be there. I got rid of them with another quick regexp:
:g/\s*[a-zA-Z]/d
I also removed all the indentation:
:%s/\s*//g
Finally I converted each line into a SQL INSERT statement in a two step process. First padded the begging of each line with appropriate SQL syntax and enclosed the actual SIC code in quote marks:
:%s/\(^[0-9]*\)\s/INSERT INTO sic (code,description) VALUES ('\1','/g
Then I added the closing quote mark, paren and semicolon at the end of each line:
:%s/$/');/g
That’s it. It took me only few minutes to go from a fairly messy and machine-unfriendly list, to a clean SQL file. I figured that I might as well post it here for your convenience. If you ever find yourself in need of importing all the SIC codes and their descriptions into a database, just use my SQL file:
Your browser does not support iframes.
Or better yet, just right-click and save the SIC Codes SQL file from here.
If this seems like a throw away “let me show you the boring stuff that I’ve been doing lately” post it’s because it is. This week is totally killing me. I used some of my vacation days as padding in between Christmas and New Year’s so haven’t really been working since the 24th. So this week seems to be impossibly long. I’m serious, I feel like it has been 2 months since Sunday.
Really neat.
I’m not used to regexp, so when it comes to me I open my lisp REPL, and get some quick loop to do the job. Last time I did something like that was to pretty print a list of words for the pictionary game :D
Perl should be another great string-cruncher, but seriously from a lisper point of view, I never needed it.
Cheers
– Jaba
@Jaba: Heh, doing it the lisp way is possibly even more awesome. I did it in my text editor because it seemed intuitive to do it that way.
I was showing the converted SQL to someone and the were like “OMG, I can’t believe you actually did all of this”. I had to explain that I didn’t actually do it line by line, but with editor commands and that it took me 2 minutes. :P
Yeah, I agree. It was not a question on awesomeness :D just my 2 cents.
I never had the guts to start learning VI(M). I tried with emacs – but instead of investing the time on learning to do wonders on editors, I preferred the investment on Lisp. I really don’t use editors beyond… well, editing text :D I mean, directly.
My choices are notepad++ in windows and kate in linux. Great instruments to do what I need, of course nothing like the Two God’s Tools. But they’re sure more… friendly ^_^
ps “Lucasz”? Hell, we must be neighbors so! :D Ye olde Europe, isn’t it? ;)
Forgot to mention that Perl would probably use the same techniques as I did since it allows you to inline regexps.
[quote post=”2813″]ps “Lucasz”? Hell, we must be neighbors so! D Ye olde Europe, isn’t it?[/quote]
More accurately we used to be somewhat distant neighbors. I was born and lived in Poland until I was 17-18. Now I live in the US of A. The plugin that puts flags on comments tells me you are most likely located somewhere in or around Italy so there is a hueg ocean between us right now. :)
:D ahah, yeah, Italy is correct (nice plugin). Here is 00.18, so right now there are lots of water and hours between us. But I was talking about origins ;)
BTW, another two cents before going to bed – lots of sleep to recover.
In Lisp, to make it simple, I would do as follow:
1. read the input file, placing all elements in a list
2. use (remove-if) function to get rid of garbage
3. (with-open-file) destination path
4. (format) Seibel style :D
Just go here and scroll down for “Iteration”. With ~{ and ~} you can iterate your control through lists, so the format call would be something like:
or something like that – I’m not going into debugging right now ;)
The ~ symbol indicates a directive. ~{ and ~} consumes one list arguments, and iterates through elements. ~& ensures you are on a new line (your highlighting plugin gets an error on the &). ~A is for pretty printing “anything”, consumes one argument each (so iteration through the list goes two elements at a time). And ~% forces a new line.
There is plenty of similar directives – it really is a sub-language, you can see on Seibel’s page -, but pretty much all useful functions stay in just one page. When I need it, I just go to that page to get the proper syntax ^_^
One beautiful thing is that the loop is implicit, inside the format, so you don’t care.
Hope you’ll enjoy it ^_^
cheers
– Jaba
@Jaba: Wow, that’s actually very elegant. Nice! Your lisp-fu is truly better than mine! ;)
Thank you, but trust me: that is not even a nail of the complex elegance of which Lisp is capable, and I have barely started exploring it. It’s just that I’m doing some research on genetic programming on lisp, for my degree thesis, so I had to learn at least the basis.
If you like the lisp elegance I can write you a simple macro example to show you what you can do only in lisp.
Once upon a time, I needed to store reports about some variables, that would look like:
A 10
B 12
C -7
This can be stored in an object Report, or simply in a structure, but there was a problem: every time the number and type of the variables would change. So the sequent time the report would look like:
A 10
B 12
or even:
X “foo”
Y 27
W “bar”
So, I couldn’t specialize any data structure. I kept it to simple lists (they’re small reports, I don’t need any optimization), but I write a macro that, given the variables in input, would write the subsequent report:
Ok, I don’t know if that’s the best way to write it, but really they’re only reports on a really big project, and I was in a hurry. I’ll try to explain:
defmacro defines a macro. That is, it must be like a function that returns a list, actually of code, but instead of returning it, it EVALUATES it. That is, you have a program that writes program, that then get evaluated.
The &rest gets any number of arguments, and put them all in a list, here called args.
The ` operator, let’s say that builds a list from a template. For example
is the list (a b c), but elements preceded by comma (,) get evaluated, so
returns the list (a 4 c).
The comma-at operator splices a resulting list into elements, for example
evaluates to the list (a (1 2 3) c), but
evaluates to the list (a 1 2 3 c).
add-quotes is a function of mine that simply returns all elements of the input list preceded by a quote:
so a call of (add-quotes ‘(a b c)) would evaluate to the list (‘a ‘b ‘c) (so that all elements gets quoted)
pairs is another of mine that simply bound together in a list one element at a time from each of the N input lists:
Ok, now that we speak the same language, I can explain build-report (I paste it again here):
Let’s say you have two variables, A and B, that contain (refer to) the values 10 and 15. You need a fuckin’ report about those two guys. So, you just call:
Yeah, that’s all. Let’s see the expansion.
First step: args is the list (a b), so we enter two backquotes (“), then the inner command is a comma-at (,@). So let’s execute the command and split the list. We have to call pairs on two arguments. The first is a call to:
that returns the list (‘a ‘b). We can then call pairs:
That gets evaluated to the list ((‘a a) (‘b b)).
Ok, ok, ok, now, it’s a comma-at, so it gets spliced, and what we get is the list (‘a a ‘b b), ok? And with that we have consumed one back-quote and the comma-at (` and ,@).
Now, back to where the comma-at command came. This is resolved, so we can get the second (left-most) comma into action, with the second back-quote.
If there was a single element on the right, the comma would get applied to the single one, but since it’s spliced the argument, gets spliced the comma too, so we now have:
`(,’a ,a ,’b ,b)
that is like saying (list ‘a a ‘b b), nothing more nothing less. That’s the final list.
But here comes the miracle of macros: I said there are like functions that return list, only the list is actually of code, and once ready gets evaluated straight away.
So, we evaluate (list ‘a a ‘b b), and what’s the result?
(a 10 b 15)
So long and thanks for all the fish. You have your pretty report. And, as you can imagine, you could call build-report with as many variables as you want, as long as they’re declared in the ambient of the call.
I’ve built some list to extract stats from the reports, too, and to say the truth I should have used alists (that are, ((a . 10) (b . 15)), only implementation details) to have it simpler and faster and working with built-in functions. But I simply don’t give a fuck :D with lisp it’s often faster write your own instruments that look for the ones you need:
and I assure you, this gets the work pretty well done, and the critical time is not machine time from the ’70: now it’s the programmer time, so I’ll learn about alists when I’ll really need it ;P
Ok, enough: I really should start a blog of mine instead of spamming yours :P but I hope you’ll find that interesting.
So, from the deep of my lazy heart, thank you for this interesting-spammable blog :P
cheers
– Jaba
ps I kind of cheated on the real call on list with the backquotes, to keep them hidden until the end. I hope I’ll be forgiven, for the sake of legibility ;)
pps there’s again the problem with the lisp language plugin that can’t get the & character and displays instead & …
mmm let me re-write the end of the last pps:
with pre lang=”lisp” plugin, the & character gets represented as:
@Jaba: Ah, yes. The code highlighting plugin actually escapes all the HTML entities automatically. :)
And yeah, I’m somewhat familiar with macros. The beautiful thing about lisp is that it’s code is expressed in therms of it’s data structures. So data can be code, and code can be data, which introduces a mind boggling level of flexibility into the language.
Thanks for the examples. I will have to go over them slowly again later. :)
Hey… I have been searching for just this kind of file! I was hoping at best for TXT or CSV but to find the SQL is ideal so thank you very much!
@Alex: Btw, a recent ticket in my bug tracking system tells me that list is incomplete. It’s missing entries between 23xx and 35xx codes.
I shall update it here once I get to it at work. Must clear all the stuff that is marked critical and omg-the-sky-is-falling first.
Hey Luke,
That’s strange, I just checked my DB and I have all the values, including between 23xx and 35xx. I did have to sanitize the SQL file a bit; you had semicolons in the `description` values which were throwing things off. I have an updated SQL file with all the quotes sanitized which ran without any errors; if you want I’ll email it to you, just let me know and leave me your email address (mine is alex@imperialtextile.com.
@Alex: Ah, that what it was then – some of the statements failed, but I didn’t notice and/or care when I imported it.
Then I read your post literally like 5 minutes after reading the bug report about missing values on the test system and I was like “aw, crap!”. :)
It shall be easy to fix now. Thanks!
Glad I could help! :) Thanks again, this made a report I was writing much easier!
Using your ready-made SIC code SQL. Thanks — this will save me a bit of time!
Pretty neat solution!
Thanks! :)
Nice but the SIC data is hierarchical and the schema you defined makes it hard to find parent and child entries.
This should be more sth like:
CREATE TABLE `sic_code` (
`code` varchar(32) NOT NULL DEFAULT ”,
`label` varchar(256) DEFAULT NULL,
`parent_code` varchar(32) DEFAULT NULL,
`level` int(11) DEFAULT NULL
) DEFAULT CHARSET=utf8;
INSERT INTO `sic_code` (`code`,`label`,`parent_code`,`level`)
VALUES
(‘D’,’MANUFACTURING’,NULL,0),
(‘I’,’SERVICES’,NULL,0),
(‘H’,’FINANCE, INSURANCE, AND REAL ESTATE’,NULL,0),
(‘F’,’WHOLESALE TRADE’,NULL,0),
(‘E’,’TRANSPORTATION, COMMUNICATIONS, ELECTRIC, GAS, AND SANITARY SERVICES’,NULL,0),
(‘J’,’PUBLIC ADMINISTRATION’,NULL,0),
(‘3199′,’Leather goods, not elsewhere classified’,’319′,3),
(‘478′,’Miscellaneous services incidental to transportation’,’47’,2),
(‘473′,’Arrangement of transportation of freight and cargo’,’47’,2),
(‘474′,’Rental of railroad cars’,’47’,2),
(‘8399′,’Social services, not elsewhere classified’,’8