Python Basic Datastructure Creation Time

If there’s one thing that really grinds my gears, it’s when people use lists instead of tuples to define what is intended to be an immutable constant. Why? Because it’s faster to create a tuple. Here’s a case as an example:

class SomeClass(object):
    AVAILABLE_KEYS = [
         'key1',
         'key2',
         'key3',
         'key4',
    ]
    def __init__(self):
         pass

Clearly, as a class attribute, the initial developer has intended that SomeClass.AVAILABLE_KEYS is some sort of constant. If it were me seeing this, it would bother me that we could easily make the creation of this list microseconds faster with a minor change. Just to see how correct I am, I set out to do a basic experiment to capture just how long it takes to create common datastrucutres in python. So I hacked together a simple script:

from subprocess import Popen, PIPE

STEP_SIZE = 1000
SAMPLE_SIZES = [n + 1 for n in xrange(0, 11000 * 100000, STEP_SIZE)]


def get_time_from_output(output_str):
    '''
    Input str: "1000000 loops, best of 3: 1.31 usec per loop"
    '''
    print output_str
    usec_time = float(output_str.split()[5])
    if "msec" in output_str:
        usec_time *= 1000.0
    return usec_time


tuple_results = []
list_results = []
set_results = []
dict_results = []
for numbers_to_process in SAMPLE_SIZES:
    try:
        print "Attempting a sample size of %s" % numbers_to_process

        create_tuple = "junk = (%s)" % ",".join([str(j) for j in xrange(numbers_to_process)])
        create_list = "junk = [%s]" % ",".join([str(j) for j in xrange(numbers_to_process)])
        create_set = "junk = {%s}" % ",".join([str(j) for j in xrange(numbers_to_process)])
        create_dict = "junk = {v:v for v in xrange(%s)}" % numbers_to_process

        process = Popen(["python", "-m", "timeit", create_tuple], stdout=PIPE)
        output, err = process.communicate()
        tuple_results.append(get_time_from_output(output))

        process = Popen(["python", "-m", "timeit", create_list], stdout=PIPE)
        output, err = process.communicate()
        list_results.append(get_time_from_output(output))

        process = Popen(["python", "-m", "timeit", create_set], stdout=PIPE)
        output, err = process.communicate()
        set_results.append(get_time_from_output(output))

        process = Popen(["python", "-m", "timeit", create_dict], stdout=PIPE)
        output, err = process.communicate()
        dict_results.append(get_time_from_output(output))
    except OSError:
        print "Argument too long...aborting"
        break

import pylab
pylab.figure()
pylab.plot(tuple_results, color='r', label='Tuple Creation Time')
pylab.plot(list_results, color='b', label='List Creation Time')
pylab.plot(set_results, color='g', label='Set Creation Time')
pylab.plot(dict_results, color=(0.0, 0.0, 0.0), label='Dictionary Creation Time')
pylab.legend()
pylab.xlabel('Number of elements * %s' % STEP_SIZE)
pylab.ylabel('Time in uSec')
pylab.show()

…and now we can find out just how long it takes to compare the creation time of lists, tuples, sets, and dictionaries. I threw in sets and dictionaries just because I had already started the experiment, but to begin with my initial qualm, here’s the comparison of time between lists and tuples given a number of elements:


Screen Shot 2014-09-11 at 9.20.19 AM

How Right Am I?

Clearly, there’s an extra 6 to 8 microseconds of overhead associated with lists. This means that in a fairly optimal request/response to and from a web server of 200 ms, we’re increasing the response time by about 0.0034% totally unnecessarily! So, I’m totally right. Sort of.

Time for Tuples / List / Sets / Dictionaries


Screen Shot 2014-09-11 at 7.53.32 AM

The results for other common python datastructures are pictured above. Obviously, creation time is O(n) for every type, but I was actually fairly surprised to see how much longer a set or a dictionary took to create relative to a list/tuple. For me, this changes my perspective a bit for statically defined constants. Even though we’re only talking about microseconds here, it’s still a matter of principal for me. The question for me is essentially what should be favored between smaller time complexity and datastructure overhead if we have a fixed number of elements.
For example, consider the following 2 list comprehensions:

[item for item in fixed_list if item in SOME_DEFINED_TUPLE]
[item for item in fixed_list if item in set(SOME_DEFINED_TUPLE)]

Which is faster? Does it really even matter? In the second case, the number of iterations across the entire list comprehension will be much smaller by some kind of quadratic factor, but does that make anything faster when casting a tuple to a set incurs significant (microseconds) overhead?

Clearly I’m splitting hairs here, but all in all, this is at the very least insightful knowledge that could easily be shared at a cocktail party, and in sharing it you would get an insurmountable amount of high fives.