Revisiting Perl and Python's Speed
I was really surprised to see the discussion that was generated as the result of my previous post comparing the speed of Python and Perl. Many people much wiser than me posted a lot of valuable comments and suggestions, and two people were kind enough to post total rewrites of my routines which (to nobody's surprise) were much faster than the codes I wrote.
A few people (both here on the blog and through other discussion) raised legitimate points:
Methodology: I ran each code on the same 3750 MB input file five times in serial succession. Each execution was timed using the `time` builtin provided by the bash 4.1.5(1) included with Ubuntu 10.04. stdout was redirected straight to /dev/null.
So even cleaner code runs over 40% faster in Perl than Python, which is not far off from the 50% slowdown I noted with my two crumbier versions of the code. Furthermore, it seems easier for a relative novice like myself to write inefficient Python code over Perl code. Of course, it's also easier to write Perl code that doesn't do what you expect, and trying to understand someone else's code is a crapshoot.
Judging by what others have told me and some comments have pointed out though, Python just isn't optimized for "practical extraction and reporting." Maybe someday I'll find a use for Python in my work.
In case the links to the codes I used ever go bad, here they are on pastebin:
I'd post the input files I used, but I don't have anywhere I can host 3.7 GB files. If you're interested in the input data, let me know and I can send a private link.
A few people (both here on the blog and through other discussion) raised legitimate points:
- My Python code was recompiling the regex every loop iteration because I was confused by how regex compilation and regex match objects work. Fixing this problem alone increased speed by 10%-25%.
- The timings I posted were sub-second and someone suggested that startup overhead may have been hurting Python. To address this, I used a more "real-life" input file that was 3,750 MB rather than the 8.588 MB input file I used earlier.
- The style of Perl I was using was archaic, and the style of Python I was using wasn't terribly Pythonic. I live in a programming bubble; I learned both of these languages from their respective O'Reilly books and that's it. I don't know anyone who knows either Perl or Python in real life, and I have never seen anyone else's code in either language. But as it turns out, poorly written Perl and poorly written Python follow the same trends as well-written Perl and Python (see below).
- Software
- Ubuntu Server 10.04 LTS
- Python 2.6.5 provided by the distribution
- Perl 5.10.1 provided by the distribution
- data resides on an ext4 lvm
- Hardware
- HP DL360 G7
- 2x Xeon X5672, 3200 MHz
- 24GB DDR3 RAM
- data resides on 6Gbit SAS RAID5
- Codes
- "Old Perl" code is the code shown in my previous post.
- "Old Python" code is also shown in my previous post.
- "New Perl" code is the code written by gnustavo.
- "New Python" code is the code written by Paul Davis.
Methodology: I ran each code on the same 3750 MB input file five times in serial succession. Each execution was timed using the `time` builtin provided by the bash 4.1.5(1) included with Ubuntu 10.04. stdout was redirected straight to /dev/null.
Trial | Walltime | Trial 1 | Trial 2 | Trial 3 | Trial 4 | Trial 5 |
Old Python | 309.032 | 310.971 | 308.228 | 311.331 | 307.170 | 307.461 |
New Python | 176.880 | 178.099 | 174.742 | 175.463 | 178.235 | 177.863 |
Old Perl | 167.051 | 166.916 | 165.911 | 167.361 | 168.735 | 166.333 |
New Perl | 126.860 | 125.913 | 124.709 | 130.125 | 127.809 | 125.746 |
So even cleaner code runs over 40% faster in Perl than Python, which is not far off from the 50% slowdown I noted with my two crumbier versions of the code. Furthermore, it seems easier for a relative novice like myself to write inefficient Python code over Perl code. Of course, it's also easier to write Perl code that doesn't do what you expect, and trying to understand someone else's code is a crapshoot.
Judging by what others have told me and some comments have pointed out though, Python just isn't optimized for "practical extraction and reporting." Maybe someday I'll find a use for Python in my work.
In case the links to the codes I used ever go bad, here they are on pastebin:
I'd post the input files I used, but I don't have anywhere I can host 3.7 GB files. If you're interested in the input data, let me know and I can send a private link.