On The Implementation of Float and List Types in Python
Edit 2022-04-16: fix grammar
An edited version of this post has been published in two parts on heurekadevs.com: first part, second part.
How is the float
and list
data types represented in Python?
Why do they take up so much memory?
Why are some operations on them so slow?
And why so many people turn to third party libraries, such as numpy
?
In this article, we take a closer look at the implementation of two Python data types: float
and list
.
We will see that with the flexibility of dynamically typed language, Python, come drawbacks in the form of added complexity and bigger memory requirements.
Throughout the post, I make a few simplifications (e.g. regarding any reusing of objects by the interpreter) to keep it brief. However, there should not be much loss of generality when we think of arbitrarily large lists and random floats.
Also, I describe these structures as they are implemented in CPython (specifically CPython 3.10.0).
PyObject
Let me start by saying that all data types in Python are objects. There are no primitive types.
On a lower level, every object “inherits” from PyObject
struct.
More precisely, because the CPython is in C, inheritance is implemented by holding an instance of the predecessor class.
This provides an option to cast a pointer to every object to a pointer to PyObject
(in C++, a pointer to a descendant can be cast to a pointer to a predecessor).
This casting possibility is an important feature, as we will see later.
The PyObject
struct contains a reference counter and pointer to a type:
typedef struct _object {
_PyObject_HEAD_EXTRA
Py_ssize_t ob_refcnt;
PyTypeObject *ob_type;
} PyObject;
where _PyObject_HEAD_EXTRA
is a macro for creating two pointers forming a doubly-linked list of objects for the garbage collector (to be able to handle self references):
typedef struct {
// Pointer to next object in the list.
// 0 means the object is not tracked
uintptr_t _gc_next;
// Pointer to previous object in the list.
// Lowest two bits are used for flags documented later.
uintptr_t _gc_prev;
} PyGC_Head;
These pointers are created for, among others, list
, some tuple
s, some dict
s or user defined classes, but not for types that can not hold references to themselves (e.g., int
, float
, …) because those can be collected using a simple reference counter.
There is brilliant documentation of Python’s garbage collector.
PyFloatObject
The float
object contains an instance of PyObject and the value itself, which is actually C double
(64-bit floating point number) which may be slightly confusing for a developer used to C-family language.
As a result, a float
object uses 24 bytes of memory on a 64-bit system (8B for pointer on the object type, 8B for the reference counter and 8B for its value).
The implementation of PyFloatObject
:
typedef struct {
PyObject_HEAD
double ob_fval;
} PyFloatObject
where PyObject_HEAD
is just a macro for creating a PyObject
.
Let us check the size of a float:
import sys
a = 1.0
print(sys.getsizeof(a))
# result: 24
Intermission about Cache
The main memory in a computer is much slower for common operations than what would be optimal for a CPU. To speed up computation, modern computers use a faster, and smaller memory called cache (or more commonly several differently fast and sized caches). Reading from a cache is about 10-100 times faster than reading from the main memory.
When an address in memory is accessed, the CPU loads a whole block of bytes into the cache (much more than needed, usually). If the required block is already present in the cache, the main memory does not need to be accessed and the computation is faster. If, however, it is not present (so called cache miss), the CPU needs to load it, which is slow. We want to reduce the number of misses as much as possible.
There are two classes of algorithms which take caches into account: cache-aware and cache-oblivious. Cache-aware algorithms assume or know the parameters of the cache, such as the number of caches, size of the block, number of blocks that fit into a cache,… This is often infeasible - most programs run on various hardware. Cache-oblivious algorithms do not know nor assume (almost) anything about caches, they just perform faster when a cache is present.
An example of a simple cache-oblivious algorithm is a sequential scan of an array (general C-like array, not the Python array
module): when an element is accessed, a sequence of bytes surrounding the element is cached.
When the next element is accessed it is most likely already present in the cache (assuming some common block eviction strategy, like “last-used-evicted-first”).
The number of cache misses is minimal (optimal) for a sequential array scan.
On the other hand, scanning a linked list is inefficient with respect to caches. With reading one element, and loading a block of surrounding bytes into the cache, no other element in the list is likely to be present already present in the cache (assuming the list is arbitrarily long, and cache size is constant). Hence, reading almost every item makes a cache miss.
If the cache is about n
times faster than the main memory, such scan is about n
times faster in an array than it is in a linked list (this is simplified and other factors influence the performance).
For a deeper understanding of cache-oblivious algorithms, see the “Cache” chapter in Data Stuctures lecture notes by Mareš or Demaine 2002.
Back to Python…
PyListObject
As list
in Python is able to hold objects of different types, and there is no need to specify the types in advance (like it is in a statically typed language, like C/C++/…), you are probably guessing there has be some overhead for this convenience.
And overhead there is.
In both memory requirements and the time complexity of some common list
operations (we will see later).
In order to fit objects of various, previously unknown types, list
does not store the objects directly but only keeps a pointer to an array of pointers (of type PyObject
) to the objects it holds.
Keeping the pointer to an array, instead of just storing the array directly in its memory space, allows the list to have a variable length.
And as we previously established, pointers to all Python types can be cast to a pointer to PyObject, meaning any set of objects can be stored in a list.
On top of the array of pointers, it holds PyVarObject
(a base class for variable length containers in Python) which apart from a PyObject
, it also contains a counter for the number of elements stored in it.
Because it is a variable length array, it over allocates (to be explained later), so it holds a variable for the number of bytes allocated.
And because list can “contain” (hold a reference) to itself, it has to have those two pointers for garbage collection.
The structure then looks like this:
typedef struct {
PyObject_VAR_HEAD
PyObject **ob_item;
Py_ssize_t allocated;
} PyListObject;
where PyObject_VAR_HEAD
is a macro for creating a PyVarObject.
Let us check the size of a float:
import sys
a = []
print(sys.getsizeof(a))
# result: 56
Beware that this is architecture specific, version specific, and most-likely OS specific and the size (or its calculation) might slightly differ. I have tested it on a x86 macOS.
The size of an empty list is not an issue if there are a lot of elements in it. But if, for example, we want to imitate a multidimensional array using list of lists of … list, then the overhead size of a list can become significant.
Operations on List
Firstly a list needs to be created. If the number of elements to be stored in it is not known in advance, one usually creates an empty list and then appends the objects to it.
However, if the number of items is known, then it would be faster (and more memory efficient) to create a list with the exact required size.
This brings us to the append
method.
append
Readers familiar with “flexible” arrays, should feel free skip the next few paragraphs.
Let us imagine we want to have an array/list but we do not know the number of items that will be added to it at runtime. If the allocated space of the array/list would accommodate exactly the number of items currently in it, it would have to be reallocated with every added item (resulting in O(n) complexity per append, where n is the current number of items).
To overcome the reallocation preceding every append
, we always allocate little more space than necessary.
If the allocated space is large enough (a multiple of the number of items), the amortized time complexity of append
can be brought down to O(1). Amortized, in this context, basically denotes the average time complexity of an append
in a series of appends.
The worst-case complexity can be worse than the amortized complexity (it happens, when the reallocation is performed)
Similar claims hold also for remove
operation and array/list shrinking.
Note: when the array only grows, amortized complexity is the same as average complexity, but once removal of elements and “shrinking” of the array is allowed, there is a difference, but I won’t go into that.
See Mareš’s book for a description of amortized analysis.
What is interesting in CPython, is how the list
is actually grown.
Python increases the space for the list
by approximately a multiple of 1.125.
The newly allocated memory after reallocation:
new_allocated = ((size_t)newsize + (newsize >> 3) + 6) & ~(size_t)3;
where new_size
is the number of items in the list
(after appending the last item).
It basically says that the newly allocated memory is news_size * 9/8 + 6
but with the last two bits set to 0
(making the final number divisible by 4).
Note: the factor is a little bit different for very small lists.
Why is the factor so small and computed in such a way? I would be happy to know.
In C++, the growth factor of std::vector
in GCC is 2, and in Microsoft compiler, it is reportedly 1.5.
I suspect that since objects in Python are relatively large, the interpreter does not want to waste much space.
Clearly it is better to allocate the desired space beforehand rather than performing a series of appends.
Let’s see some examples of operations on list.
Examples
First, let us compare two methods for list
initialization.
The slow option:
foo = []
for _ in range(1_000_000): # finishes in 82.1 ms
foo.append(1.0)
A faster option:
foo = [1.0 in range(1_000_000)] # finishes in 53.3 ms
And a much faster option, but not really a practical one, for it just repeats the one element:
foo = [1.0] * 1_000_000 # finishes in 2.81 ms
Note: beware that this construction works differently for mutable and immutable types.
Now let’s look at the size of one element in a list
vs in an array.array
and the time needed to scan it.
import random, array, sys
foo_list = [random.random() for _ in range(1_000_000)]
# using 'd' == double
foo_array = array.array('d', (random.random() for _ in range(1_000_000)))
# need to take size of every element, as they are not stored in the list but referenced
print((sys.getsizeof(foo_list) \
+ sum([sys.getsizeof(i) for i in foo_list]) \
) / len(foo_list))
# returns: 32.448728
print(sys.getsizeof(foo_array) / len(foo_array))
# returns: 8.1838
Note that we do not call sys.getsizeof
on every element of the array
.
It is because they are already counted in the overall size of the array
, for they are stored directly in the structure.
Indeed, if we reduce the precision of the decimal number (from 64b to 32b, 'd'
to 'f'
) we get half the size:
foo_array_f = array.array('f', (random.random() for _ in range(1_000_000)))
print(sys.getsizeof(foo_array_f) / len(foo_array_f))
# returns: 4.000064
As we can see, items are much more efficiently stored in array
than they are in a list
.
Now, let’s compare the scanning speed of list
vs array.array
:
sum_list = 0.0
sum_array = 0.0
for i in foo_list: # finishes in 35 ms
sum_list += i
for i in foo_array: # finished in 43.7 ms
sum_array += i
Why is array
slower? Even though the elements of an array
are efficiently stored in memory, when reading them, an object is created for each of them (I am simplifying a little bit, as there are some optimizations) making the whole scan slower than a scan of a list, which just has to return a reference to the float
objects.
Note that, this can be sped up by using the sum()
function instead of the loop (to 5.74 ms for list
and 14.3 ms for array
.
And just as an interesting side note, in numpy
, using the numpy.sum()
function, the sum takes 0.5 ms.
Conclusion
We saw that using a list
for storing or performing operations on float
values is quite inefficient.
A list of floats is represented in memory as an array of pointers to the PyFloat
structure.
So for a list
of n
floats, the consumed memory is approximately 4\*n\*8
bytes (omitting all additive constants and possible overallocation).
Paying attention to how a list is created can also be beneficial.
Note that the described “inefficiencies” of list
and float
are not the only ones.
There is a lot more, some of them stemming from the fact that Python is an interpreted language.
Some of them can be solved by third-party libraries.
On the other hand, as Donald Knuth once said (or at least it’s attributed to him): premature optimization is the root of all evil.
So unless you repeatedly create large lists, you probably should not make the above “list creation” optimizations.
However working with real numbers is quite common in Python, in the context of data analysis or machine learning.
In such case, it makes sense to use more efficient implementations, such as the one offered by numpy
.
Sources
- https://github.com/python/cpython/blob/main/Include/cpython/floatobject.h
- https://github.com/python/cpython/blob/main/Include/object.h
- https://github.com/python/cpython/blob/main/Include/cpython/listobject.h
- https://github.com/python/cpython/blob/Include/internal/pycore_gc.h
- https://docs.python.org/3/c-api/structures.html
- http://mj.ucw.cz/vyuka/dsnotes/ - Data Structures lecture notes
- http://erikdemaine.org/papers/BRICS2002/ - Eric Demaines paper on cache-oblivious algorithms.
- https://github.com/gcc-mirror/gcc/blob/libstdc%2B%2B-v3/include/bits/vector.tcc - Growth factor of std::vector in gcc
- https://github.com/gcc-mirror/gcc/blob/libstdc%2B%2B-v3/include/bits/stl_vector.h