Recursive Word Count

People always ask me how many lines I wrote for my thesis research project. I really don’t keep track of details like that – I just concentrate on accomplishing the goals and getting results. But I figured that I should find this out so that I can just blurt out a number when someone asks and be done with it. Then I realized that wc does not recurse into subdirectories.

My code is neatly divided into packages, sub-packages and etc… So it took me a minute or two to figure out how to run a recursive word count. Here it is:

wc -l `find . -name *.java`

This will print out the name of each file along with the line count, and then give you a grand total on the last line.

If anyone is curious the number of lines is 5284 as of today. It should be more because I haven’t implemented all the features that I wanted yet (not sure if I ever will but, you know…). The longest file has 801 lines, and the shortest one has 49. Now stop asking. :mrgreen:

Update 05/29/2007 04:21:42 PM

If you want to exclude a directory from the listings use prune:

wc -l `find . -wholename './lib/*' -prune -o -name '*.java -print`

This will count the lines in all the files except those in ./lib/

[tags]wc, word count, recursive word count, lines, lines of code, thesis[/tags]

This entry was posted in school and teaching and tagged . Bookmark the permalink.



9 Responses to Recursive Word Count

  1. Craig Betts UNITED STATES Mozilla Firefox Mac OS Terminalist says:

    Nested loops, maybe, but not recursion. You need a process/subroutine to call itself. Kinda like calculating factorials.

    Try this one for size . . .

    for file in `find . -name \*.java -type f`; do \
    cat $file; done | wc -l

    This will count the total number of lines in all the *.java files, not just how many *.java files there are.

    Reply  |  Quote
  2. Luke UNITED STATES Mozilla Firefox Windows says:

    I think mine did the same thing – the find command in back ticks returns list of all the java files. The wc command can accept multiple arguments an then calculates number of lines for every one of them.

    Yours will also work, but I think the cat step will make it a tad slower.

    Oh, and by recursive I kinda meant “recurse into subdirectories”. :mrgreen:

    Reply  |  Quote
  3. Craig Betts UNITED STATES Mozilla Firefox Mac OS Terminalist says:

    Really!

    *reads man page for wc*

    Doy! I keep thinking that only the shell can resolve multiple via wildcards versus the output of the command in the back ticks.

    Maybe I just like showing off running for loops in a command line . . .

    Reply  |  Quote
  4. Luke UNITED STATES Mozilla Firefox Windows says:

    I do this too all the time. But the best advice I got regarding writing “good” shell scripts was “don’t pipe the cat unless you absolutely have to” :mrgreen:

    Some unix commands will only work with stdin, but many are able to access files on their own. Grep is probably most notorious for this:

    1. grep foo bar.txt # fast
    2. grep foo | cat bar.txt # slow

    Many people will use #2 by habit, when #1 is much faster and actually easier to type. I remember this because we would always use to joke about this:

    “Stop greping the cat you pervert!”
    “Leave the poor cat alone dude!”
    “Did you just pipe the cat? I’m gonna call animal services!”

    So yeah – every time I see cat in a script I start looking for alternate solution to rescue the poor kitty form the pipe abuse. :)

    Btw, I always forget the syntax of the bash loops and have to look it up. It also doesn’t help that I spent a lot of time in tcsh on one of the unix stations at school – and tcsh loops are widely different from bash loops.

    Reply  |  Quote
  5. Wikke BELGIUM Mozilla Firefox Windows says:

    All the cat jokes aside :P
    Isn’t line counting a bad guidance to see your progress?
    I mean, it indicates only on how many lines you can break down your code.
    Good if you’re paid by the line, but elseway, I think it’s useless.
    Especially with Java and alikes, where you basically can type your entire program on 1 line or place every word on another line.

    Reply  |  Quote
  6. Luke UNITED STATES Mozilla Firefox Ubuntu Linux says:

    You are correct Wikke. Line count is one of those metrics that doesn’t really mean much and yet people keep attach significance to it. Even when you program normally, line count can vary. For example I use the BSD/Allman indent style and:

    for(int i=0; i<a.lenght; i++)
    {
    sum += i;
    }

    On the other hand someone using the K&R style would write the same snippet of code like this:

    for(int i=0; i<a.lenght; i++) {
    sum += i;
    }

    This means that I gain 1 line of code per each block of code that requires a brace. Every loop, every if statement, every try statement, every method and class declaration will be always 1 line longer for me. The bigger the project, the more this adds up to my line count.

    Our code can be semantically equivalent, and it can compile to the same exact set of assembler instructions. But my line count will be higher simply because of a visual formating choice that I made.

    This makes for a very poor metric. Of course so is counting methods, classes, variables and etc. There is just no very good ways to measure progress and performance other than accomplishing set goals at the desired deadlines.

    That’s why at the beginning of this post I said that I never really cared how many lines my code had – because it didn’t really mean anything. But when I tell people about the project they keep asking me about the line count.

    Reply  |  Quote
  7. Craig Betts UNITED STATES Mozilla Firefox Solaris Terminalist says:

    I myself actually live on a tcsh command line, but I script in bourne. Sick? Demented? Most likely. I started in UNIX land as a programmer, so tcsh was a logical choice. As an admin though, well . . . tcsh is a really poor choice (can’t redirect both standard-in and standard-out at the same time :-() Not to mention all the exploits tc sh leaves open. But I like the feel of the tcsh command line.

    As far as “grepping the cat”, when you are running a Sun V1280 with twelve 1.2 GHz SPARC III cu processors and 80 GB of RAM, those two microseconds you save are not really going to be of value later . . . :-D

    If resources are the scarce, you really need to go back to C and drop Java . . . ;-)

    Reply  |  Quote
  8. Luke UNITED STATES Mozilla Firefox Ubuntu Linux says:

    Well, the 12 CPU’s are not going to really amount to any speedup – the whole script will likely run on a single CPU because none of the core unix tools is multi-threaded.

    Btw, I totally want that machine! All I have here to test my multi threaded apps is an ancient Sun-Fire-880 with four 750MHz CPU’s and just a bit over 8GB of RAM . :(

    But you’re right – with a fast CPU it doesn’t really matter if you grep the cat or not. Still even with unlimited resources greping the file directly instead of greping the cat is less typing. :)

    Reply  |  Quote
  9. Michele Antolini ITALY PHP says:

    Under OSX:

    wc -l `find * -name "*.*"`
    or
    wc -l `find * -name "*.java" -or -name "*.html"`

    i.e. just put double quotes ( ” ) around filter expression after -name

    Reply  |  Quote

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>