Ondřej Měkota | Brief Look at CPython String

Have you ever noticed that a few character string in Python uses several tens of bytes of memory? In our team, which works on matching offers of the same products together, we handle a lot of short strings in Python, and I have been wondering why Python stores strings seemingly so inefficiently.

In this article, we will take a brief look at how Python strings are implemented, and why they consume so much memory and we will see some interesting aspects of Python string.

In the previous blog post (and on heurekadevs.com here: first and second), we taken analysed the implementation of float and list types in Python. We have established, that every Python object has a common ancestor, a common basis, the PyObject struct.

In the rest of the article I assume the reader is to some extent familiar with character representations and encodings (ASCII, UTF-8 – why variable character length is used). Also, for any memory calculations, I assume a 64-bit *nix system.

Python String Structures

In lower-level languages, such as C, a “string” is commonly represented as an array of characters (generally integers, in C, types such as char, wchar_t are used) terminated by null character.

Python has three string representations (but the last one is deprecated - PyUnicodeObject, and it will be removed in Python 3.12, so I’ve omitted it from this article). The first string object, PyASCIIObject, is for ASCII only strings and the other, PyCompactUnicodeObject, is for strings that contain at least one non-ASCII character. PyCompactUnicodeObject actually uses PyASCIIObject as its base and only extends it.

PyObject_HEAD is a macro for creating PyObject, which has 16 bytes, as we have shown. The length is the number of characters in the string, it has 8 bytes. There is also a hash of the string (it does not have to be calculated), 8 bytes.

typedef struct {
    PyObject_HEAD               
    Py_ssize_t length;          /* Number of code points in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    struct {
        unsigned int interned:2;
        unsigned int kind:3;
        unsigned int compact:1;
        unsigned int ascii:1;
        unsigned int ready:1;
        unsigned int :24;
    } state;
    wchar_t *wstr;
} PyASCIIObject;

Then there is the struct with all the bitfields:

interned says whether the string is not interned, interned or interned and immortal
- Interning is a memory saving optimisation detail. It also improves performance of some operations, such as equality comparison – if immutable objects are equal, then their contents are also equal.
- By interning a string (or any object for that matter), Python ensures that only one instance of such object is stored in memory, i.e. if 'hello'was interned, then for any two strings a = 'hello' and b = 'hello' anywhere in the program a is b would be True
- All function, class, and variable names are interned, and so are the keys of dictionaries, for example. Also, any string can be explicitly interned.
- An interned string marked as immortal is never deleted. However, this feature is deprecated since Python 3.10, scheduled for removal in Python 3.12 (and it seems it is not used in practice).
  - A side note for readers interested in the immortality of objects, there is a PEP proposal on making some objects (e.g. None) truly immortal and immutable, including their reference count. This has several interesting consequences, such as the avoidance of cache invalidation after each reference count change.
kind either 1, 2, or 4 byte unicode symbol (ASCII has 1 byte)
compact (legacy bit) for the two described objects it is always set to 1
ascii says whether the string is composed of ASCII only characters
ready (legacy bit) for the two described objects it is always set to 1
And a 24-bit padding to align the structure to 4 bytes.

Finally, there is a pointer wstr which will be referred to later.

The string itself is stored just after this structure, again, this will be explained later.

typedef struct {
    PyASCIIObject _base;
    Py_ssize_t utf8_length;     /* Number of bytes in utf8, excluding the
                                 * terminating \0. */
    char *utf8;                 /* UTF-8 representation (null-terminated) */
    Py_ssize_t wstr_length;     /* Number of code points in wstr, possible
                                 * surrogates count as two code points. */
} PyCompactUnicodeObject;

String are commonly created through PyUnicode_New C function, if they contain a character with a code point higher than 127, the PyUnicodeCompactObject structure is used, otherwise PyASCIIObject. There are other ways to create a string, and some result in the legacy PyUnicodeObject.

As mentioned before, the actual string is stored directly after this structure, and it’s not referenced from it. Also, the stored string is none of UTF-{8,16,32} encodings but only the fixed-size UCS code points (=numbers, IDs,..) are stored. The reason for that is that fixed size characters are necessary so that, for example, indexing works in constant time – generally it is much easier to work with strings where each character takes up the same amount of space. UTF only encodes these UCS code points to variable-sized characters, thus saving a lot of space (usually) at the cost of some operations.

At the beginning, utf8 is NULL and utf8_length is zero. utf-8 is basically a cache for the UTF-8 representation of the string. This representation is created on-demand, for example when a string is about to be written to a file (in UTF-8), and it is cached for further usage. utf8_length is the number of bytes in UTF-8 encoded version of the string, excluding the null at the end.

The wstr pointer and wstr_length are legacy and they are scheduled to be removed in Python 3.12, soo I won’t describe them here. An interested reader can refer string structs definitions and the PyObject_New function for further understanding.

Immutability

Strings in Python are immutable, as it is described in the documentation. “Immutable” means that once the object is created it cannot be changed. So there can be many references to such an object and all of them can assume the object will never change. Also, the immutability of an object is a necessary (but not a sufficient) condition for being a key in a dictionary.

However, sometimes, it seems that strings are actually mutable. Try the following example.

counter = 0
string1 = ''

for i in range(1_000_000):
     old_id = id(string1)
     string1 += str(i)
     if id(string1) != old_id:
         counter += 1

print(counter)  # returns 47

After most of the concatenations, the new object occupies the same space as the old object and reuses the data of the old object, appending a string at the end. It may happen only if there is only one reference to the object (and there is free memory space following the object). This is optimisation detail of CPython, making concatenation of strings much faster (in some cases) There is a blog post about it. Or see the implementation of unicode_resize, resize_inplace functions in CPython

Memory Consumption

Let us see how much memory strings use. If we create a new empty string, it uses PyASCIIObject and it uses 49 bytes:

import sys

print(sys.getsizeof(''))  # prints 49

Why 49 bytes? Let’s count

PyObject uses 16 bytes
length, hash, wstr* uses 8 bytes each, so 24B total
state uses 4 bytes
the actual string uses 1 byte, as it is null-terminated (\0)

That makes it 45 bytes. However, the structure needs to be aligned to 8 bytes, so it is ‘padded’ using 4 more bytes (the structure then takes 48 bytes).

If we add one ASCII character

print(sys.getsizeof('a'))  # prints 50

it uses 50 bytes.

Now, what if we use non-ASCII characters, like ž. ž has code point 382, so at least 2 bytes are necessary to store this character. And also it will use the PyCompactUnicodeObject:

print(sys.getsizeof('ž'))  # prints 76

These strings are getting quite big 🙂. Let’s count again:

The PyASCIIObject uses 48 bytes
utf_length, utf*, and wstr_length use 8 bytes each, so 24B total
4 bytes for the actual string (two characters, including \0, each uses 2 bytes)

So that is total 76 bytes.

Now let’s have a long ASCII text, 10000 characters:

text = 'a' * 10**4
print(sys.getsizeof(text))  # prints 10049

The size of this is expected (10001 bytes for the string, and 48 bytes for the structure).

Let’s add one non-ASCII character to the text, a big, 4 byte one:

print(sys.getsizeof(text+🖕))  # prints 40080

The size of the string is about 4 times larger. It is because now, each character uses 4 bytes (the size of the largest character). The actual text (10002 characters) uses 40008 bytes, and the PyCompactUnicode structure uses 72 bytes, as we have shown earlier.

Conclusion

In this blog post, we have analysed the structures that are used to store strings in CPython. We have compared them with actual Python code.

We have also shown that Python strings are not completely immutable.

Note that all of the above holds specifically for the reference Python implementation, CPython, and it might (and it likely does) differ in other Python implementations.

Python String Structures

Immutability

Memory Consumption

Conclusion

References