Lisp: Parse and Aggregate a CSV File

Posted on December 22, 2007 by Luke Maciak tein.co/2160

Yesterday I wrote about that little problem aggregating data in a big csv file. Just for shits and giggles, here is the lisp code to parse and aggregate a file like that. It assumes that we group the entries based on the first column, and aggregate the last. The file would look something like this:

something, stuff, poop, 20 something else, other stuff, pooper, 12 other stuff, stuff, poop, 55 something, abc, xx, 5

The script takes the name of the csv file to parse from the command line arguments. Here is the code:

;;; get split-sequence from http://www.cliki.net/SPLIT-SEQUENCE
(load "split-sequence.lisp")
(setq aggregate '()) ;; the association list to store results

(with-open-file (stream (first *args*)) ;; open the file specified on cli
    (do ((line (read-line stream nil) 	;;; initialize LINE with with read-line
               (read-line stream nil))) ;;; the increment step (result gets stored in LINE)
        ((null line)) 			;;; termination condition (read-line returns NIL on EOF)

	;;; split the read string on a comma using the split-sequence function
	(setq tmp (split-sequence:SPLIT-SEQUENCE #\, line ))

	(setq key (intern (first tmp))) 	;; our key

	;; convert the last value to an int
	(setq newval (parse-integer (first (last tmp)) :junk-allowed t))

	;; check if key is in the assoc list; if it's not, VAL will be NIL
	(setq val (cdr (assoc key aggregate)))

	(if val
	  (rplacd (assoc key aggregate) (+ newval val)) 	;; if exist update old entry
	   (setq aggregate (acons key newval aggregate))	;; else add a new entry
	)
    )
    
    (print aggregate)
)

Please feel free to pick it apart, and make suggestions. I do not claim that this is the best way to do things – I just hacked it up for fun. I was surprised that CLISP does not have a built in function to split a string on a delimiter. It seems such a fundamental feature, and almost every modern language has one. Instead of writing my own, I actually used this one.

This is especially odd considering the very complex with-open-file function which actually opens and closes the file, and handles error checking in the background. This a very simple and elegant way of handling I/O. Makes you wonder why a split string function didn’t make the cut. :P

The output is not pretty but it works. It would take few more lines to format it the right way for import into excel but I got bored after that. SQL still wins on simplicity, and I’m pretty sure I could produce tighter and less verbose code in perl or python. But hey – I set out to make my life difficult by using lisp, and I did. This one actually made me hit the online manuals and documentation a bit to figure out how to do simple things the lisp way.

[tags]lisp, csv, comma separated values, lisp file operations[/tags]

This entry was posted in Uncategorized. Bookmark the permalink.

19 Responses to Lisp: Parse and Aggregate a CSV File

Starhawk says:

December 22, 2007 at 5:43 pm

Hey thanks this is something I can actually use :) And btw thanks also for the latex and e-macs stuff ya been posting lately.

A couple of months ago i down loaded and installed clisps (and slime) myself promising myself I would actually learn it and use it sometime in the winter because i don’t work much or even at all then. To actually and seriously program I need time off work.

I Will look at your code tomorrow if I don’t work (I never know when I’m going ta work or not gonna work, I just learn to deal with).

String manipulation is something I usually use alot in my code and I’m pretty good at it in langs like C so to use lisps effectively I need to get a grip on it. So your code will at least motivate me to think about it so thanks ;)

Reply | Quote
Luke Maciak says:

December 23, 2007 at 5:37 am

Thanks! I’m glad someone is reading that stuff.

My biggest issue with lisp is that the easily googleable documentation is usually fragmentary or confusing. There some good books out there, but they do not always deal with the problem at hand. :P

Reply | Quote
Jaba says:

December 23, 2007 at 5:07 pm

Using LET* would let you save lots of setq.
I’m glad you’re willing to think in lisp, so here’s my little tip:
http://cl-cookbook.sourceforge.net/
I suggest you the mentioned book “common lisp: a gentle introduction to symbolic computation”, because the real power of lisp is that it’s a FUNCTIONAL programming language. The code you did seemed ugly because you’re still thinking imperative – you used lots of setq because you wanted to manipulate variables. But it’s a good start.

I post you a function I was working on few days ago. I am coding genetic algorithms in lisp, and here’s the definition for “mutation” (no matter what it means):

For every element in a population, for every bit of a “binary string”-individual, there’s an as little as will Pm probability to mutate the bit instead of keeping it.

The imperative pseudocode would involve two for cycles, bounded by two variables, and arrays and memory allocation etc etc. Functional programming works differently. Here’s my one-line implementation:

(defun mutate (pop pm) (mapcar #'(lambda (ind) (mapcar #'(lambda (bit) (if (

Reply | Quote

Jaba

says:

December 23, 2007 at 5:07 pm

that can be formatted as

(defun mutate (pop pm)
  (mapcar #'(lambda (ind)
                  (mapcar #'(lambda (bit)
                                  (if (

Jaba says:

December 23, 2007 at 5:10 pm
(sorry for the ugly commenting, but I’m experiencing lots of problems. Please edit it at will)

that can be formatted as
```
(defun mutate (pop pm)
..(mapcar #'(lambda (ind)
..................(mapcar #'(lambda (bit)
..................................(if (
```
Reply | Quote
Jaba says:

December 23, 2007 at 5:12 pm
(the blog doesn’t seem to accept the LESS THAN symbol… another entry…)

(sorry for the ugly commenting, but I’m experiencing lots of problems. Please edit it at will)

that can be formatted as

(defun mutate (pop pm)(sorry for the ugly commenting, but I’m experiencing lots of problems. Please edit it at will)

that can be formatted as
```
(defun mutate (pop pm)
..(mapcar #'(lambda (ind)
..................(mapcar #'(lambda (bit)
..................................(if (
```
Reply | Quote
Jaba says:

December 23, 2007 at 5:13 pm

(wordpress: I hate you. I give up.)

I hope my message looks humble enough as I wish: I’m no genius in lisp, but lisp, even if equipotential, is very more expressive than lots of more common languages.

Best regards

-J

Reply | Quote
Luke Maciak says:

December 23, 2007 at 5:15 pm

Thanks for the tips Jabba. You are right, I should have used a let statement. And it’s true – this is still an imperative code, but I was having a hard time figuring out how to do the file I/O without a loop. :P

On, and btw, try putting your code in

<pre lang=”lisp”> </pre>

This will preserve whitespace and escape special characters.

Reply | Quote
Luke Maciak says:

December 23, 2007 at 5:25 pm

The less than character gets eaten by wordpress because it is a start of a HTML tag. The <pre lang=”lisp”> </pre> environment however is handled by my code highlighting plugin which will automatically escape the < character by converting it into < so you can just copy and paste code directly into the commend box. :)

Reply | Quote

Jaba

says:

December 23, 2007 at 5:40 pm

You’re too polite not to try it immediatly. I’m pasting a lisp implementation of a simple local search algorithm, that tries to find a number with as many “ones” as possible in his binary representation, with solution space 0 to n. Sorry for the documentation in Italian :P

PSEUDO-CODE:

init(start);
i:=istart;
repeat
  generate (j in Nh(i))
  if fit(j) >= fit(i) then i:=j;
until f(j)

WORKING LISP IMPLEMENTATION:
with much care on training on the “tail recursion” technic, that can generate code as fast as fortran does (look at “performance” on CLISP homepage)

(defun dec2bin (n)  ; ci serve per ottenere la rappresentazione binaria del max
  "Restituisce la rappresentazione binaria di N, come lista di bit"
  (dec2bin-acc n nil))

(defun dec2bin-acc (n acc)
  "Versione di dec2bin con accumulatore per tail recursion"
  (if (= (length lst) n) lst
    (expand-bin (cons 0 lst) n)))

(defun fitness (n)
  "Restituisce la fitness di un numero"
  (length (remove-if #'zerop n)) )

(defun generate (i top) ;; a partire da questa versione, top è un numero decimale
  "Restituisce il primo tra i migliori vicini di I, stando attento a non sforare TOP"
  (let ((pos (position 0 i :from-end t)))
    (if (null pos) i
      (let ((copy (copy-list i)))
        (setf (nth pos copy) 1)
        (if (= (fitness i) (fitness j)) )

(defun best (lst)
  "Data una lista di codifiche binarie, restituisce quella con fitness migliore"
  (car (sort lst #'best-fit-check)) )

(defun local-search (n)
  "Esegue una ricerca locale con spazio delle soluzioni tra 0 e N"
  (do* ((i     (init n)     (best (list i j)))
        (j     (generate i n) (generate i n))
        (i-dec (bin2dec i)  (bin2dec i))
        (fi    (fitness i)  (fitness i))         ;NOTA: solo per leggibilità, per ottimizzazione rimuoverle
        (fj    (fitness j)  (fitness j)))
       ((or (equal i j) (> fi fj))  ; ecco che vado ad assicurarmi dal loop sul massimo
        (format t #|"~&FINE:|#"~&Soluzione finale:  ~D = ~A - Fitness: ~D" i-dec i fi)
         i-dec)
       (format t "~&Passaggio attuale: ~D = ~A - Fitness: ~D" i-dec i fi) ))

Hope this can help, I spent some months to gain functional mentality – and it’s still work in progress.

Have fun with lisp

-J

Jaba

says:

December 23, 2007 at 5:42 pm

PSEUDO-CODE:

init(start);
i:=istart;
repeat
  generate (j in Nh(i))
  if fit(j) >= fit(i) then i:=j;
until f(j)

(defun dec2bin (n)  ; ci serve per ottenere la rappresentazione binaria del max
  "Restituisce la rappresentazione binaria di N, come lista di bit"
  (dec2bin-acc n nil))

(defun dec2bin-acc (n acc)
  "Versione di dec2bin con accumulatore per tail recursion"
  (if (= (length lst) n) lst
    (expand-bin (cons 0 lst) n)))

(defun fitness (n)
  "Restituisce la fitness di un numero"
  (length (remove-if #'zerop n)) )

(defun generate (i top) ;; a partire da questa versione, top è un numero decimale
  "Restituisce il primo tra i migliori vicini di I, stando attento a non sforare TOP"
  (let ((pos (position 0 i :from-end t)))
    (if (null pos) i
      (let ((copy (copy-list i)))
        (setf (nth pos copy) 1)
        (if (= (fitness i) (fitness j)) )

(defun best (lst)
  "Data una lista di codifiche binarie, restituisce quella con fitness migliore"
  (car (sort lst #'best-fit-check)) )

(defun local-search (n)
  "Esegue una ricerca locale con spazio delle soluzioni tra 0 e N"
  (do* ((i     (init n)     (best (list i j)))
        (j     (generate i n) (generate i n))
        (i-dec (bin2dec i)  (bin2dec i))
        (fi    (fitness i)  (fitness i))         ;NOTA: solo per leggibilità, per ottimizzazione rimuoverle
        (fj    (fitness j)  (fitness j)))
       ((or (equal i j) (> fi fj))  ; ecco che vado ad assicurarmi dal loop sul massimo
        (format t #|"~&FINE:|#"~&Soluzione finale:  ~D = ~A - Fitness: ~D" i-dec i fi)
         i-dec)
       (format t "~&Passaggio attuale: ~D = ~A - Fitness: ~D" i-dec i fi) ))

Hope this can help, I spent some months to gain functional mentality – and it’s still work in progress.

Have fun with lisp

-J

Jaba

says:

December 23, 2007 at 5:48 pm

(sorry, never posted before on wordpress, I’m seeing only now the “preview” button… this will never occour again, please edit my previous entry, thank you)

PSEUDO-CODE:

init(start);
i:=istart;
repeat
  generate (j in Nh(i))
  if fit(j) >= fit(i) then i:=j;
until f(j)


(defun dec2bin (n)  ; ci serve per ottenere la rappresentazione binaria del max
  "Restituisce la rappresentazione binaria di N, come lista di bit"
  (dec2bin-acc n nil))

(defun dec2bin-acc (n acc)
  "Versione di dec2bin con accumulatore per tail recursion"
  (if (= (length lst) n) lst
    (expand-bin (cons 0 lst) n)))

(defun fitness (n)
  "Restituisce la fitness di un numero"
  (length (remove-if #'zerop n)) )

(defun generate (i top) ;; a partire da questa versione, top è un numero decimale
  "Restituisce il primo tra i migliori vicini di I, stando attento a non sforare TOP"
  (let ((pos (position 0 i :from-end t)))
    (if (null pos) i
      (let ((copy (copy-list i)))
        (setf (nth pos copy) 1)
        (if (= (fitness i) (fitness j)) )

(defun best (lst)
  "Data una lista di codifiche binarie, restituisce quella con fitness migliore"
  (car (sort lst #'best-fit-check)) )

(defun local-search (n)
  "Esegue una ricerca locale con spazio delle soluzioni tra 0 e N"
  (do* ((i     (init n)     (best (list i j)))
        (j     (generate i n) (generate i n))
        (i-dec (bin2dec i)  (bin2dec i))
        (fi    (fitness i)  (fitness i))         ;NOTA: solo per leggibilità, per ottimizzazione rimuoverle
        (fj    (fitness j)  (fitness j)))
       ((or (equal i j) (> fi fj))  ; ecco che vado ad assicurarmi dal loop sul massimo
        (format t #|"~&FINE:|#"~&Soluzione finale:  ~D = ~A - Fitness: ~D" i-dec i fi)
         i-dec)
       (format t "~&Passaggio attuale: ~D = ~A - Fitness: ~D" i-dec i fi) ))

with much care on training on the “tail recursion” technic, that can generate code as fast as fortran does (look at “performance” on CLISP homepage)

Hope this can help, I spent some months to gain functional mentality – and it’s still work in progress.

Have fun with lisp

-J

Luke Maciak says:

December 23, 2007 at 10:56 pm

I fixed your entries. You missed a </pre> tag here and there. And yeah, the preview button is there for times like this. But no worries.

Thanks for the code samples! I wish the comments were in English :P

Reply | Quote

Luke Maciak

says:

December 23, 2007 at 11:14 pm

Here is a cleaner version with let… Still trying to figure out how to make it more lispy:

;;; get split-sequence from http://www.cliki.net/SPLIT-SEQUENCE
(load "split-sequence.lisp")
(setq aggregate '()) ;; the association list to store results

(with-open-file (stream (first *args*)) ;; open the file specified on cli
    (do ((line (read-line stream nil) 	;;; initialize with read-line
               (read-line stream nil))) ;;; step is read-line
        ((null line)) 			;;; termination condition
	
	;; split the output string on a comma using the split-sequence
	(setq tmp (split-sequence:SPLIT-SEQUENCE #\, line ) )

	(let (
		(key 	(intern (first tmp))					) 	;; our key
		(newval (parse-integer (first (last tmp)) :junk-allowed t)	)	;; convert the last value to an int
	     )
		   (setq val 	(cdr (assoc key aggregate))	)	;; check if key is in the assoc; if it's not, VAL will be NIL
		    
		   (if val
	  		(rplacd (assoc key aggregate) (+ newval val)) 	;; if exist update old entry
	   		(setq aggregate (acons key newval aggregate))	;; else add a new entry
		)
	)
    )
    (print aggregate)
)

Jaba

says:

December 24, 2007 at 3:46 am

I still haven’t tried your code (I’m packing for holydays, I’ve a fly for this evening!), but as requested I tried to make it “a little lispier” without changing what you did. It’s just a taste matter, in this case, but the result is really good. Some problems, you simply have to resolve in simil-imperative way. But one thing I really love of lisp is that you CAN do anything, even imperative, you just don’t HAVE to. Ever heard of CLOS, the Common Lisp Object System? :D
http://cl-cookbook.sourceforge.net/clos-tutorial/index.html
Yes, in lisp you CAN go object oriented (and LOOK AT THE TUTORIAL, it’s soooo simple…), but you don’t have to. Usually, you just need some struct, that comes with free get/set functions :D
Just try it, if you wish, and let me know ;) have fun!

-J


  ;;; CVS PARSING FUNCTION - EXECUTABLE FILE  ; tree ";" for big comments, two for full-line comments, one for in-line comments... just conventions, as *global-variable*...
(load "split-sequence.lisp")
(setf aggregate nil) ; setf is setq when needed, but it's more flexible in general
 (parse-csv)	; this way you call a normal function, that can be reused
 
 ;; I formatted it a little more lispy, as teached in "Common Lisp - A gentle introduction to symbolic computation"
 ;; It's esasier with spaces than with tabs. Suggested base tabulation is of 2 spaces
(defun parse-csv (&optional (filename (first *args*)))	; &optional says: if you give parse-csv an argument, it's the filename; otherwise, get it from (first *args)
  "Parses a CSV file and DOES THINGS"  ; in-line documentation. Try (documentation 'parse-csv 'function) :D
  (with-open-file (stream filename) 
    (do ((line (read-line stream nil) (read-line stream nil))) 
        ((null line) aggregate) ; returns the aggregate value in output - no need to print it
	    (let* (		; let* is equivalent to let but bounds its arguments one at a time
		    (tmp    (split-sequence:SPLIT-SEQUENCE #\, line )) 	; this can still be made in LET*
		    (key    (intern (first tmp)))
		    (newval (parse-integer (car (last tmp)) :junk-allowed t)) ; CAR is lispy :P
		    (val    (cdr (assoc key aggregate))) ) ; this can stay in LET* too
          (if val   (rplacd (assoc key aggregate) (+ newval val))
	   		        (setq aggregate (acons key newval aggregate))))))

ps I code under linux with clisp (sometimes cmucl), simply using kwrite as editor (no time to learn emacs still… and gedit doesn’t have lisp highlighting). Under windows, the best editor i ever used is NOTEPAD++ (google it, you won’t regret it)

Luke Maciak says:

December 24, 2007 at 3:56 am

Ah! Let* – that’s what I was looking for. Funny thing – I completely forgot that it exists. Thanks!

Re: editor – Vim all the way! I run it as my default text editor in every single OS. ;)

Reply | Quote
Jaba says:

December 24, 2007 at 4:52 am

:D intresting, I’ll try vim so, thanks!

Reply | Quote
Pingback: Lisp: Parse and Aggregate a CSV File [ Terminally Incoherent ]
Frank Shearar says:

November 10, 2009 at 7:27 am

It’s very common for CSV files to allow the comma as part of a field… as long as the field’s quoted. For instance:

1,2,”3,4″,5.

SPLIT-SEQUENCE will, I imagine, misparse the 3rd field and say that the line contains 5 fields when it only contains 4.

Further, quoted fields can themselves contain quotes, escaped as “”. So this is a legal CSV line too:

1,2″Then I was all like “”so what?”””

And the final thing is that quoted fields may contain linebreaks. (I must say, I’ve never seen this in the field!)

Reply | Quote

Lisp: Parse and Aggregate a CSV File

19 Responses to Lisp: Parse and Aggregate a CSV File

Leave a Reply Cancel reply

Subscribe

Social

Categories

Archives

Interesting Blogs

My Software Projects

People I Know

Web Comics I Read

Meta

Lisp: Parse and Aggregate a CSV File

19 Responses to Lisp: Parse and Aggregate a CSV File

Leave a Reply Cancel reply

Subscribe

Social

Categories

Tags

Archives

Interesting Blogs

My Software Projects

People I Know

Web Comics I Read

Meta