Update, 13 February: I made an amusing and instructive error, detailed below!
Many C and C++ programmers write code as if they believe that
size_t
is a typedef
for, or otherwise equivalent to,
signed int
. Such code is unnecessarily buggy and unsafe, because
that belief is not true. It never could have been true, and never has been. Yet
classes of bugs are pervasive because the false belief persists.
Let’s look at the first real C standard, C89. (Well, a copy of the draft,
since that’s all we can get for free.) Modern C and C++ standards documents
state matters somewhat more completely (and sometimes more clearly), but I want
to show that the distinction between size_t
and int
goes back 2 generations. In any case it is fair to say that code written after
1989 must show awareness of this crucial distinction.
Section
4.1.5 defines the type size_t
:
size_t
[,] which is the unsigned integral type of the result of thesizeof
operator;
Section
3.3.3.4, The sizeof
operator, explains what sizeof
is all about:
The
sizeof
operator yields the size (in bytes) of its operand, which may be an expression or the parenthesized name of a type. The size is determined from the type of the operand, which is not itself evaluated. The result is an integer constant.[...] When applied to an operand that has array type, the result is the total number of bytes in the array. When applied to an operand that has structure or union type, the result is the total number of bytes in such an object, including internal and trailing padding.
The value of the result is implementation-defined, and its type (an unsigned integral type) is
size_t
defined in the <stddef.h> header.[...] A principal use of the
sizeof
operator is in communication with routines such as storage allocators and I/O systems. A storage-allocation function might accept a size (in bytes) of an object to allocate and return a pointer to void. For example:extern void *alloc(); double *dp = alloc(sizeof *dp);[...] Another use of the
sizeof
operator is to compute the number of members in an array:sizeof array / sizeof array[0]
In the example, alloc
is a hypothetical memory allocation
function, but we find also that C89’s actual allocation functions,
calloc
, malloc
, and realloc
, also use
size_t
for their size arguments①. For example, from section
4.10.3.3:
void *malloc(size_t size);
and similarly, calloc
is defined as:
void *calloc(size_t nmemb, size_t size);
Although section 3.3.2.1,
Array subscripting, says only vaguely that a subscript of an array “shall
have integral type”, we know from the above that that integral type must
ultimately be size_t
. This is for 2 reasons.
First, size_t
must be an unsigned integral value with the width
of a machine word, so that it can be possible for a C program to index any
position in the machine’s address space. (See Practical Implications.)
Second, since it’s unsigned, size_t
may have twice (or more) the
positive range of a signed int
. So if the subscript type were
signed int
, it might be possible to allocate (using
size_t
) an array with more elements than can be indexed (using
int
).
Therefore, we must conclude that size_t
is the only correct
type for all sizes, lengths, object counts, and indices/subscripts. C89 and
its standard library use it pervasively for these purposes; not just in
allocation functions but in functions like strlen
,
memset
, memcpy
— all those old
friends.
I have a 64-bit machine with 32 GiB of memory. On this system,
int
is 32 bits and size_t
is 64. Using C++ to get
access to std::is_signed
, we see:
% cat types.cc #include <stdio.h> #include <limits> #include <type_traits> using namespace std; int main() { constexpr bool is_int_signed = is_signed<int>::value; printf("sizeof(int): %zu; signed: %d\n", sizeof(int), is_int_signed); printf("largest int value: %d\n", numeric_limits<int>::max()); constexpr bool is_uint_signed = is_signed<unsigned>::value; printf("sizeof(unsigned): %zu; signed: %d\n", sizeof(unsigned), is_uint_signed); printf("largest unsigned value: %u\n", numeric_limits<unsigned>::max()); constexpr bool is_size_t_signed = is_signed<size_t>::value; printf("sizeof(size_t): %zu; signed: %d\n", sizeof(size_t), is_size_t_signed); printf("largest size_t value: %zu\n", numeric_limits<size_t>::max()); } % make types && ./types clang++ -Weverything -Werror -std=c++20 types.cc -o types sizeof(int): 4; signed: 1 largest int value: 2147483647 sizeof(unsigned): 4; signed: 0 largest unsigned value: 4294967295 sizeof(size_t): 8; signed: 0 largest size_t value: 18446744073709551615
As you can see, if malloc
and friends took int
arguments, it would be possible to allocate only 2 GiB (2,147,483,647 bytes).
Even if these functions took unsigned int
arguments, it would be
possible to allocate only 4 GiB (4,294,967,295 bytes). For my fancy modern
machine, these would be unacceptable limitations.
On a 32-bit machine, which can only address 4 GiB, size_t
is
indeed equivalent to uint32_t int
. But on a 64-bit machine,
size_t
must be equivalent to uint64_t
, and so it is.
That is why this works:
% cat large.c #include <stdio.h> #include <stdlib.h> #include <string.h> int main() { size_t gib = 1 << 30; size_t ten_gib = 10 * gib; printf("About to allocate %zu bytes\n", ten_gib); char* large = malloc(ten_gib); memset(large, 'A', ten_gib); printf("The last byte is: %c\n", large[ten_gib - 1]); } % make large && time ./large clang -Weverything -Werror -std=c2x large.c -o large About to allocate 10737418240 bytes The last byte is: A ./large 2.49s user 3.40s system 90% cpu 6.540 total
Even on a 32-bit machine, where size_t
is uint32_t
,
i.e. there is no difference in width, the difference in signedness still
matters. You might think that, on machines of this era, since the kernel used up
half of the 4 GiB the address space and userland got the other half, it would be
OK for size_t
to be a signed 32-bit type — you can only allocate at
most 2 GiB anyway, right?
Well, no. First, it is possible, desirable, and commonplace for the kernel to use 1
GiB of the address space, leaving 3 GiB for userland on 32-bit systems. If
you want to write a program to operate on 2.5 GiB of data, size_t
being unsigned makes that possible.
Second, you might not be writing code for a traditional operating system that
puts the kernel and userland in the same address space (albeit with different
page protections). You might be writing code for a special system for which even
the application runs in kernel mode and needs to access all 4 GiB of the address
space. Again, size_t
being unsigned makes that possible.
Using int
as the type for sizes and subscripts would in
principle lead not just to unnecessary limitations, but actually does
lead to unnecessary bugs.
To see why and how, first remember that signed integer overflow is undefined behavior (UB) in C and C++. From appendix 6.2:
The behavior in the following circumstances is undefined:
[...]
An arithmetic operation is invalid (such as division or modulus by 0) or produces a result that cannot be represented in the space provided (such as overflow or underflow) (3.3).
That is, an expression such as INT_MAX + n
(where n
is an int
greater than 0) has no particular meaning, and the
compiler can therefore interpret it to mean anything. Usually this means the
compiler will optimize away code that ‘cannot’ happen, or make other assumptions
that might not match your own. Therefore, statements and expressions that
exercise UB cannot in general be correct. Even if code with UB appears to
work, the people affected by that code are just getting lucky. For now. New
inputs, or new compilers, might change the program’s behavior.
However, in section 3.1.2.5, C89 provides a carve-out for unsigned arithmetic:
A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting unsigned integer type.
That is, arithmetic on unsigned types is modular arithmetic: UINT_MAX
+ 1
is defined to wrap back around to 0, like the odometer in a car.
This has important implications for correctness in memory allocation and subscripting. For example, this code has several bugs:
int count = ...; // Implicit cast from `size_t` to `int`; possible (though not in this case) // truncation. int size = sizeof(Thing); // The signed multiplication could overflow and is technically UB. If // you’re lucky, your implementation might define it to be modular, and // to wrap. But then you’re allocating a region that cannot hold `count` // `Thing`s — it will have wrapped around to a too-small value. // // In any case, the result of the multiplication is cast to `size_t` for // the call to `malloc`, which may result in sign extension, which // might result in even more weirdness. Thing* things = malloc(count * size); // At this point, if the allocation succeeded and if `things` points to a // region of memory large enough to hold `count` `Thing`s, it’s pure luck. // This code is incorrect, even if it ‘seems to work’. for (int i = 0; i < count; i++) { things[i] = ...; }
Code like this is widespread; I’ve seen it in the wild many times. Especially
when the value of count
and/or the data that get copied into
things
come from an untrustworthy source (like the network), such
code is often straightforwardly exploitable②.
We can make code like this less incorrect by doing something like this:
#include <stdckdint.h> // ... const size_t count = ...; const size_t size = sizeof(Thing); size_t total; if (ckd_mul(&total, count, size)) { return ENOMEM; } Thing* things = malloc(total);
ckd_mul
is a standard C23 function that returns true if the
multiplication overflowed. Yes, C
will introduce checked arithmetic in November 2023, 51 entire years after
the language was born. Until then, you can (and should) use non-standard
compiler intrinsics, or you can try to roll your own check:
#include <stdbool.h> #include <stddef.h> // Update: This is wrong! See below. bool check_mul(size_t* result, size_t x, size_t y) { size_t r = x * y; if (r < x || r < y) { return true; } *result = r; return false; } // ... const size_t count = ...; const size_t size = sizeof(Thing); size_t total = 0; if (check_mul(&total, count, size)) { return ENOMEM; }
Note that you must use unsigned types in functions like
If you use signed types, the multiplication may overflow and will
thus be undefined, and the compiler will therefore typically assume that
overflow cannot happen. Then it will likely ‘optimize’ the code by removing your
‘dead’ check_mul
above! (I.e. the ‘check if result was smaller’
style.)if
block — removing your safety check.
Update: Inevitably, sigh 😅, I got this wrong, which I would have
noticed if I had done an exhaustive check of uint32_t
as I did for
another problem, below. (There’s a lesson there!) Jann Horn points out that “you
can get a multiplication overflow and still have a result that is bigger than
the two operands”, e.g. 0x10000 * 0x11000 = 0x110000000. Thank you, Jann! Let’s
stick with stdckdint.h or compiler intrinsics.
If you need to check signed arithmetic, you must check limits before doing the potentially undefined operation. For example:
#include <stdbool.h> #include <stddef.h> bool checked_mul(int* result, int x, int y) { if (x == 0 || y == 0) { *result = 0; return false; } if (x < 0 || y < 0) { // TODO: You have even more work to do. See e.g. // https://github.com/coreutils/gnulib/blob/master/lib/intprops-internal.h#L370 // for a type-generic macro that handles all cases. `ckd_mul` and the // non-standard intrinsics are lookin’ pretty good right about now... 😉 abort(); } if (INT_MAX / x < y || INT_MAX / y < x) { return true; } *result = x * y; return false; }
If that all sounds like a pain in the ass (because it is), you can use
calloc
. In C23 (see section 7.24.3.2 of the C23
draft), and in responsible implementations going back 15+ years,
calloc
is defined to check the count * size
multiplication and to return NULL
if the product is too
large③.
Basically, do not use malloc
directly. Idiomatic usage is
typically incorrect and unsafe. Security reviewers like to find some easy
pickins by looking at a codebase, running grep -ri alloc *
, and
looking for overflowing arithmetic expressions. They can find a lot of fun
stuff. (Try it yourself! You can do much more along these lines with weggli by Felix Wilhelm et
al.)
In Extra,
Extra - Read All About It: Nearly All Binary Searches and Mergesorts are
Broken, by Joshua Bloch in 2006, we learn that even the simple binary search
turns out to be buggy due to the use of the wrong subscript type. In 2
languages, no less! Java was defined to use int
as its array index
type, and int
is defined as a signed 32-bit integer. Unlike C, Java
int
is defined to have modular behavior on overflow, like C’s
unsigned types.
The buggy line is in finding a new midpoint for the search:
int mid = (low + high) / 2;
In C, this is UB and there are no guarantees whatsoever. (Bloch says that “In C this causes an array index out of bounds with unpredictable results”, but spatial unsafety is the outcome only if the people affected by your program are ‘lucky’ and the arithmetic overflow is implemented to have modular behavior. But there’s no guarantee of that, so the C version of this program has extra bonus UB: arithmetic overflow and buffer overflow.)
In Java, the code is incorrect but safe, because the arithmetic overflow is defined and the invalid array access is defined. This is why I like to say, “Java actually is what people imagine C to be.” What is UB in C is very often well-defined in Java.
That said, Java is still wrong to have used int
as the array
index type, even if only because it is an unnecessary limitation on program data
size (as discussed above). But it also leads to this incorrectness problem — in
Java, the program will throw
when it actually could have
successfully completed the binary search. Bloch notes that 1 way to fix it is to
use (you guessed it) unsigned arithmetic:
Probably faster, and arguably as clear is:
int mid = (low + high) >>> 1;In C and C++ (where you don’t have the
>>>
operator), you can do this:mid = ((unsigned int)low + (unsigned int)high)) >> 1;
Those approaches will work when low
and high
are
int
s and thus their sum will always fit in unsigned
int
.
But if we used the correct index type, size_t
, we don’t have the
extra headroom of unsigned
— it’s already unsigned — so we need to
actually ensure that the overflow does not happen. Bloch’s first solution,
int mid = low + ((high - low) / 2);
is the only one that yields the correct midpoint in all cases.
To see why, consider an extreme case, in which we are indexing a giant array of bytes that holds all bytes in the address space. We can see that calculating the true midpoint requires care in modular arithmetic, both for 32- and 64-bit sizes. (For the 64-bit case, pretend for the moment that we can afford that much RAM, and that we have a machine that actually does use all 64 bits for addressing bytes.)
% cat midpoint.c #include <assert.h> #include <inttypes.h> #include <limits.h> #include <stdio.h> int main() { { static_assert(sizeof(unsigned) == sizeof(uint32_t), "Assuming `unsigned` is `uint32_t`"); printf("32-bit:\n"); const uint32_t low = UINT_MAX - 2; const uint32_t high = UINT_MAX; uint32_t mid = (low + high) / 2; printf("(%" PRIu32 " + %" PRIu32 ") / 2 = %" PRIu32 "\n", low, high, mid); mid = low + ((high - low) / 2); printf("%" PRIu32 " + ((%" PRIu32 " - %" PRIu32 ") / 2) = %" PRIu32 "\n", low, high, low, mid); } { static_assert(sizeof(size_t) == sizeof(uint64_t), "Assuming `size_t` is `uint64_t`"); printf("64-bit:\n"); const uint64_t low = SIZE_T_MAX - 2; const uint64_t high = SIZE_T_MAX; uint64_t mid = (low + high) / 2; printf("(%" PRIu64 " + %" PRIu64 ") / 2 = %" PRIu64 "\n", low, high, mid); mid = low + ((high - low) / 2); printf("%" PRIu64 " + ((%" PRIu64 " - %" PRIu64 ") / 2) = %" PRIu64 "\n", low, high, low, mid); } } % make midpoint && ./midpoint clang -Weverything -Werror -std=c2x midpoint.c -o midpoint 32-bit: (4294967293 + 4294967295) / 2 = 2147483646 4294967293 + ((4294967295 - 4294967293) / 2) = 4294967294 64-bit: (18446744073709551613 + 18446744073709551615) / 2 = 9223372036854775806 18446744073709551613 + ((18446744073709551615 - 18446744073709551613) / 2) = 18446744073709551614
Obviously, on a 64-bit machine, we are highly unlikely to find ourselves
exercising this edge case. ☺️ (Although don’t
get too complacent.) But as you can see, with 32-bit
size_t
, we are well within range of trouble. (See demo below.)
In any case, I would rather have such foundational functions as binary search be correct because they are correct, rather than ‘good enough‘ contingent on the limitations of the platform the program runs on.
That said, on a 64-bit machine, we can emulate and exhaustively test what
would happen in a 32-bit machine if we were to use uint32_t
to
emulate size_t
as the index type. My example program, exhaustive.c, creates an array with 4 billion elements,
and then attempts to use binary search to find all of them. It shows that if you
use an unsigned index type, you have to use (as midpoint.c suggests) the
low + ((high - low) / 2)
method to get a correct program. Using the
correct arithmetic takes about 6 minutes on my laptop:
% date ; time ./exhaustive c ; date Sat Feb 11 22:35:47 PST 2023 Created sorted array of `uint32_t` values. ./exhaustive c 347.17s user 8.99s system 95% cpu 6:13.90 total Sat Feb 11 22:42:01 PST 2023
The incorrect arithmetic gets stuck in an infinite loop, unable to find a valid midpoint (as midpoint.c suggested will happen). In this example, I gave up after 18 minutes:
% date ; time ./exhaustive i ; date Sat Feb 11 22:42:29 PST 2023 Created sorted array of `uint32_t` values. ^C ./exhaustive i 1031.42s user 18.88s system 95% cpu 18:22.13 total
Types matter. That ‘everything is an int
in C’ is no more true
than that ‘everything is a list in Lisp’ — and we should be glad of that! The nearly
typeless B language, and the untyped lambda calculus, are insufficient as
programming tools. Application-domain types clearly matter, but primitive types
matter too — perhaps especially, since higher-level types and functions are
built on the primitives.
Proofs of correctness of binary search are valid only on the assumption that languages can easily express correct integer arithmetic. (See this one, for example.) In fact, of course, our programming languages make expressing even simple arithmetic incredibly difficult and unwieldy. A complete proof must take the failures of real-world language design into account. This gap between theory and practice, between computing science and software engineering, is wider than people sometimes realize④.
As for the not-fully-correct solution being ”probably faster”, that is
unlikely to be significant. The practical performance difference between
mid = ((unsigned int)low + (unsigned int)high)) >> 1
and
int mid = low + ((high - low) / 2)
is 1 additional subtraction
operation on registers, i.e. 1 machine cycle. (The / 2
gets optimized to
>> 1
on a modern optimizing compiler.) In a loop
involving non-local accesses to main memory — you’re jumping around in the
array, not processing it in a linear, cache-friendly way — that fraction of a
nanosecond is not going to make or break the performance-fitness of your
program.
Assuming we’re already using the best available data structures and algorithms, the most significant way to make programs faster is by increasing parallelism. In general, correctness and safety are crucial to achieving parallelism. Observing the distinctions between types is a particularly effective way to improve program correctness and safety.
In any case, we’ve been living with unnecessarily buggy C/C++ programs since 1989, even accounting for all the other kinds of bugs inherent and special to C/C++. 33 years of entirely preventable bugs and exploitable vulnerabilities, all because it’s too hard to express the 1 thing computers actually do: arithmetic.
A good engineer never blames their tools. But a good engineer is always searching for the best available tools.
1. K&R, 2nd edition,
page 187 shows a version of malloc
taking unsigned
,
while p. 167 has it correct as size_t
. It seems likely that K&R
just forgot to update the example on p. 187 when updating the book for the 2nd
edition, which was updated to describe C89.
2. When interviewing security-focused engineers, I often ask them to spot the bugs in code like this, and to explain how it could be exploited.
3. If application-specific
macro-benchmarks and testing show that calloc
’s zeroing memory
makes the program too slow for its purpose, you can define your own allocation
function that checks the multiplication and then passes the result to
malloc
. Of course, this point only applies if your application has
performance fitness tests. Unless you have such tests, calloc
is
not too slow.
4. It reminds me a bit of how with big-O notation, we ignore constant factors because in principle they don’t matter — but in practice, for actual inputs to algorithms implemented on actual computers, the constant factors can be decisive.