Brief Look at CPython String
Have you ever noticed that a few character string in Python uses several tens of bytes of memory? In our team, which works on matching offers of the same products together, we handle a lot of short strings in Python, and I have been wondering why Python stores strings seemingly so inefficiently.
In this article, we will take a brief look at how Python strings are implemented, and why they consume so much memory and we will see some interesting aspects of Python string.
In the previous blog post (and on heurekadevs.com here: first and second), we taken analysed the implementation of float
and list
types in Python. We have established, that every Python object has a common ancestor, a common basis, the PyObject
struct.
In the rest of the article I assume the reader is to some extent familiar with character representations and encodings (ASCII, UTF-8 – why variable character length is used). Also, for any memory calculations, I assume a 64-bit *nix system.
Python String Structures
In lower-level languages, such as C, a “string” is commonly represented as an array of characters (generally integers, in C, types such as char
, wchar_t
are used) terminated by null character.
Python has three string representations (but the last one is deprecated - PyUnicodeObject
, and it will be removed in Python 3.12, so I’ve omitted it from this article). The first string object, PyASCIIObject
, is for ASCII only strings and the other, PyCompactUnicodeObject
, is for strings that contain at least one non-ASCII character. PyCompactUnicodeObject
actually uses PyASCIIObject
as its base and only extends it.
PyObject_HEAD
is a macro for creating PyObject
, which has 16 bytes, as we have shown. The length
is the number of characters in the string, it has 8 bytes. There is also a hash
of the string (it does not have to be calculated), 8 bytes.
typedef struct {
PyObject_HEAD
Py_ssize_t length; /* Number of code points in the string */
Py_hash_t hash; /* Hash value; -1 if not set */
struct {
unsigned int interned:2;
unsigned int kind:3;
unsigned int compact:1;
unsigned int ascii:1;
unsigned int ready:1;
unsigned int :24;
} state;
wchar_t *wstr;
} PyASCIIObject;
Then there is the struct
with all the bitfields:
interned
says whether the string is not interned, interned or interned and immortal- Interning is a memory saving optimisation detail. It also improves performance of some operations, such as equality comparison – if immutable objects are equal, then their contents are also equal.
- By interning a string (or any object for that matter), Python ensures that only one instance of such object is stored in memory, i.e. if
'hello'
was interned, then for any two stringsa = 'hello'
andb = 'hello'
anywhere in the programa is b
would beTrue
- All function, class, and variable names are interned, and so are the keys of dictionaries, for example. Also, any string can be explicitly interned.
- An interned string marked as immortal is never deleted. However, this feature is deprecated since Python 3.10, scheduled for removal in Python 3.12 (and it seems it is not used in practice).
- A side note for readers interested in the immortality of objects, there is a PEP proposal on making some objects (e.g.
None
) truly immortal and immutable, including their reference count. This has several interesting consequences, such as the avoidance of cache invalidation after each reference count change.
- A side note for readers interested in the immortality of objects, there is a PEP proposal on making some objects (e.g.
kind
either 1, 2, or 4 byte unicode symbol (ASCII has 1 byte)compact
(legacy bit) for the two described objects it is always set to 1ascii
says whether the string is composed of ASCII only charactersready
(legacy bit) for the two described objects it is always set to 1- And a 24-bit padding to align the structure to 4 bytes.
Finally, there is a pointer wstr
which will be referred to later.
The string itself is stored just after this structure, again, this will be explained later.
typedef struct {
PyASCIIObject _base;
Py_ssize_t utf8_length; /* Number of bytes in utf8, excluding the
* terminating \0. */
char *utf8; /* UTF-8 representation (null-terminated) */
Py_ssize_t wstr_length; /* Number of code points in wstr, possible
* surrogates count as two code points. */
} PyCompactUnicodeObject;
String are commonly created through PyUnicode_New
C function, if they contain a character with a code point higher than 127, the PyUnicodeCompactObject
structure is used, otherwise PyASCIIObject
. There are other ways to create a string, and some result in the legacy PyUnicodeObject
.
As mentioned before, the actual string is stored directly after this structure, and it’s not referenced from it. Also, the stored string is none of UTF-{8,16,32} encodings but only the fixed-size UCS code points (=numbers, IDs,..) are stored. The reason for that is that fixed size characters are necessary so that, for example, indexing works in constant time – generally it is much easier to work with strings where each character takes up the same amount of space. UTF only encodes these UCS code points to variable-sized characters, thus saving a lot of space (usually) at the cost of some operations.
At the beginning, utf8
is NULL
and utf8_length
is zero. utf-8
is basically a cache for the UTF-8 representation of the string. This representation is created on-demand, for example when a string is about to be written to a file (in UTF-8), and it is cached for further usage. utf8_length
is the number of bytes in UTF-8 encoded version of the string, excluding the null at the end.
The wstr
pointer and wstr_length
are legacy and they are scheduled to be removed in Python 3.12, soo I won’t describe them here. An interested reader can refer string structs definitions and the PyObject_New
function for further understanding.
Immutability
Strings in Python are immutable, as it is described in the documentation. “Immutable” means that once the object is created it cannot be changed. So there can be many references to such an object and all of them can assume the object will never change. Also, the immutability of an object is a necessary (but not a sufficient) condition for being a key in a dictionary.
However, sometimes, it seems that strings are actually mutable. Try the following example.
counter = 0
string1 = ''
for i in range(1_000_000):
old_id = id(string1)
string1 += str(i)
if id(string1) != old_id:
counter += 1
print(counter) # returns 47
After most of the concatenations, the new object occupies the same space as the old object and reuses the data of the old object, appending a string at the end. It may happen only if there is only one reference to the object (and there is free memory space following the object). This is optimisation detail of CPython, making concatenation of strings much faster (in some cases) There is a blog post about it. Or see the implementation of unicode_resize, resize_inplace functions in CPython
Memory Consumption
Let us see how much memory strings use. If we create a new empty string, it uses PyASCIIObject
and it uses 49 bytes:
import sys
print(sys.getsizeof('')) # prints 49
Why 49 bytes? Let’s count
PyObject
uses 16 byteslength
,hash
,wstr*
uses 8 bytes each, so 24B totalstate
uses 4 bytes- the actual string uses 1 byte, as it is null-terminated (
\0
)
That makes it 45 bytes. However, the structure needs to be aligned to 8 bytes, so it is ‘padded’ using 4 more bytes (the structure then takes 48 bytes).
If we add one ASCII character
print(sys.getsizeof('a')) # prints 50
it uses 50 bytes.
Now, what if we use non-ASCII characters, like ž
. ž
has code point 382
, so at least 2 bytes are necessary to store this character. And also it will use the PyCompactUnicodeObject
:
print(sys.getsizeof('ž')) # prints 76
These strings are getting quite big 🙂. Let’s count again:
- The
PyASCIIObject
uses 48 bytes utf_length
,utf*
, andwstr_length
use 8 bytes each, so 24B total- 4 bytes for the actual string (two characters, including
\0
, each uses 2 bytes)
So that is total 76 bytes.
Now let’s have a long ASCII text, 10000 characters:
text = 'a' * 10**4
print(sys.getsizeof(text)) # prints 10049
The size of this is expected (10001 bytes for the string, and 48 bytes for the structure).
Let’s add one non-ASCII character to the text, a big, 4 byte one:
print(sys.getsizeof(text+🖕)) # prints 40080
The size of the string is about 4 times larger. It is because now, each character uses 4 bytes (the size of the largest character). The actual text (10002 characters) uses 40008 bytes, and the PyCompactUnicode structure uses 72 bytes, as we have shown earlier.
Conclusion
In this blog post, we have analysed the structures that are used to store strings in CPython. We have compared them with actual Python code.
We have also shown that Python strings are not completely immutable.
Note that all of the above holds specifically for the reference Python implementation, CPython, and it might (and it likely does) differ in other Python implementations.