size_t Is Not int

size_t Is Not int
2023-2-12 08:0:0 Author: noncombatant.org(查看原文) 阅读量:0 收藏

12 February 2023

Update, 13 February: I made an amusing and instructive error, detailed below!

Many C and C++ programmers write code as if they believe that size_t is a typedef for, or otherwise equivalent to, signed int. Such code is unnecessarily buggy and unsafe, because that belief is not true. It never could have been true, and never has been. Yet classes of bugs are pervasive because the false belief persists.

Definitions

Let’s look at the first real C standard, C89. (Well, a copy of the draft, since that’s all we can get for free.) Modern C and C++ standards documents state matters somewhat more completely (and sometimes more clearly), but I want to show that the distinction between size_t and int goes back 2 generations. In any case it is fair to say that code written after 1989 must show awareness of this crucial distinction.

Section 4.1.5 defines the type size_t:

size_t[,] which is the unsigned integral type of the result of the sizeof operator;

Section 3.3.3.4, The sizeof operator, explains what sizeof is all about:

The sizeof operator yields the size (in bytes) of its operand, which may be an expression or the parenthesized name of a type. The size is determined from the type of the operand, which is not itself evaluated. The result is an integer constant.
[...] When applied to an operand that has array type, the result is the total number of bytes in the array. When applied to an operand that has structure or union type, the result is the total number of bytes in such an object, including internal and trailing padding.
The value of the result is implementation-defined, and its type (an unsigned integral type) is size_t defined in the <stddef.h> header.
[...] A principal use of the sizeof operator is in communication with routines such as storage allocators and I/O systems. A storage-allocation function might accept a size (in bytes) of an object to allocate and return a pointer to void. For example:
extern void *alloc();
double *dp = alloc(sizeof *dp);
[...] Another use of the sizeof operator is to compute the number of members in an array:
sizeof array / sizeof array[0]

In the example, alloc is a hypothetical memory allocation function, but we find also that C89’s actual allocation functions, calloc, malloc, and realloc, also use size_t for their size arguments ①. For example, from section 4.10.3.3:

void *malloc(size_t size);

and similarly, calloc is defined as:

void *calloc(size_t nmemb, size_t size);

Although section 3.3.2.1, Array subscripting, says only vaguely that a subscript of an array “shall have integral type”, we know from the above that that integral type must ultimately be size_t. This is for 2 reasons.

First, size_t must be an unsigned integral value with the width of a machine word, so that it can be possible for a C program to index any position in the machine’s address space. (See Practical Implications.)

Second, since it’s unsigned, size_t may have twice (or more) the positive range of a signed int. So if the subscript type were signed int, it might be possible to allocate (using size_t) an array with more elements than can be indexed (using int).

Therefore, we must conclude that size_t is the only correct type for all sizes, lengths, object counts, and indices/subscripts. C89 and its standard library use it pervasively for these purposes; not just in allocation functions but in functions like strlen, memset, memcpy — all those old friends.

Practical Implications

I have a 64-bit machine with 32 GiB of memory. On this system, int is 32 bits and size_t is 64. Using C++ to get access to std::is_signed, we see:

% cat types.cc
#include <stdio.h>
#include <limits>
#include <type_traits>

using namespace std;

int main() {
  constexpr bool is_int_signed = is_signed<int>::value;
  printf("sizeof(int): %zu; signed: %d\n", sizeof(int), is_int_signed);
  printf("largest int value: %d\n", numeric_limits<int>::max());

  constexpr bool is_uint_signed = is_signed<unsigned>::value;
  printf("sizeof(unsigned): %zu; signed: %d\n", sizeof(unsigned), is_uint_signed);
  printf("largest unsigned value: %u\n", numeric_limits<unsigned>::max());

  constexpr bool is_size_t_signed = is_signed<size_t>::value;
  printf("sizeof(size_t): %zu; signed: %d\n", sizeof(size_t), is_size_t_signed);
  printf("largest size_t value: %zu\n", numeric_limits<size_t>::max());
}
% make types && ./types
clang++ -Weverything -Werror -std=c++20 types.cc -o types
sizeof(int): 4; signed: 1
largest int value: 2147483647
sizeof(unsigned): 4; signed: 0
largest unsigned value: 4294967295
sizeof(size_t): 8; signed: 0
largest size_t value: 18446744073709551615

As you can see, if malloc and friends took int arguments, it would be possible to allocate only 2 GiB (2,147,483,647 bytes). Even if these functions took unsigned int arguments, it would be possible to allocate only 4 GiB (4,294,967,295 bytes). For my fancy modern machine, these would be unacceptable limitations.

On a 32-bit machine, which can only address 4 GiB, size_t is indeed equivalent to uint32_t int. But on a 64-bit machine, size_t must be equivalent to uint64_t, and so it is. That is why this works:

% cat large.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
  size_t gib = 1 << 30;
  size_t ten_gib = 10 * gib;
  printf("About to allocate %zu bytes\n", ten_gib);
  char* large = malloc(ten_gib);
  memset(large, 'A', ten_gib);
  printf("The last byte is: %c\n", large[ten_gib - 1]);
}
% make large && time ./large
clang -Weverything -Werror -std=c2x large.c -o large
About to allocate 10737418240 bytes
The last byte is: A
./large  2.49s user 3.40s system 90% cpu 6.540 total

Even on a 32-bit machine, where size_t is uint32_t, i.e. there is no difference in width, the difference in signedness still matters. You might think that, on machines of this era, since the kernel used up half of the 4 GiB the address space and userland got the other half, it would be OK for size_t to be a signed 32-bit type — you can only allocate at most 2 GiB anyway, right?

Well, no. First, it is possible, desirable, and commonplace for the kernel to use 1 GiB of the address space, leaving 3 GiB for userland on 32-bit systems. If you want to write a program to operate on 2.5 GiB of data, size_t being unsigned makes that possible.

Second, you might not be writing code for a traditional operating system that puts the kernel and userland in the same address space (albeit with different page protections). You might be writing code for a special system for which even the application runs in kernel mode and needs to access all 4 GiB of the address space. Again, size_t being unsigned makes that possible.

Bugs

Using int as the type for sizes and subscripts would in principle lead not just to unnecessary limitations, but actually does lead to unnecessary bugs.

To see why and how, first remember that signed integer overflow is undefined behavior (UB) in C and C++. From appendix 6.2:

The behavior in the following circumstances is undefined:
[...]
An arithmetic operation is invalid (such as division or modulus by 0) or produces a result that cannot be represented in the space provided (such as overflow or underflow) (3.3).

That is, an expression such as INT_MAX + n (where n is an int greater than 0) has no particular meaning, and the compiler can therefore interpret it to mean anything. Usually this means the compiler will optimize away code that ‘cannot’ happen, or make other assumptions that might not match your own. Therefore, statements and expressions that exercise UB cannot in general be correct. Even if code with UB appears to work, the people affected by that code are just getting lucky. For now. New inputs, or new compilers, might change the program’s behavior.

However, in section 3.1.2.5, C89 provides a carve-out for unsigned arithmetic:

A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting unsigned integer type.

That is, arithmetic on unsigned types is modular arithmetic: UINT_MAX + 1 is defined to wrap back around to 0, like the odometer in a car.

This has important implications for correctness in memory allocation and subscripting. For example, this code has several bugs:

  int count = ...;

  // Implicit cast from `size_t` to `int`; possible (though not in this case)
  // truncation.
  int size = sizeof(Thing);

  // The signed multiplication could overflow and is technically UB. If
  // you’re lucky, your implementation might define it to be modular, and
  // to wrap. But then you’re allocating a region that cannot hold `count`
  // `Thing`s — it will have wrapped around to a too-small value.
  //
  // In any case, the result of the multiplication is cast to `size_t` for
  // the call to `malloc`, which may result in sign extension, which
  // might result in even more weirdness.
  Thing* things = malloc(count * size);  

  // At this point, if the allocation succeeded and if `things` points to a
  // region of memory large enough to hold `count` `Thing`s, it’s pure luck.
  // This code is incorrect, even if it ‘seems to work’.

  for (int i = 0; i < count; i++) {
    things[i] = ...;
  }

Code like this is widespread; I’ve seen it in the wild many times. Especially when the value of count and/or the data that get copied into things come from an untrustworthy source (like the network), such code is often straightforwardly exploitable ②.

We can make code like this less incorrect by doing something like this:

#include <stdckdint.h>

// ...

  const size_t count = ...;
  const size_t size = sizeof(Thing);
  size_t total;
  if (ckd_mul(&total, count, size)) {
    return ENOMEM;
  }
  Thing* things = malloc(total);

ckd_mul is a standard C23 function that returns true if the multiplication overflowed. Yes, C will introduce checked arithmetic in November 2023, 51 entire years after the language was born. Until then, you can (and should) use non-standard compiler intrinsics, or you can try to roll your own check:

#include <stdbool.h>
#include <stddef.h>

// Update: This is wrong! See below.
bool check_mul(size_t* result, size_t x, size_t y) {
  size_t r = x * y;
  if (r < x || r < y) {
    return true;
  }
  *result = r;
  return false;
}

// ...

  const size_t count = ...;
  const size_t size = sizeof(Thing);
  size_t total = 0;
  if (check_mul(&total, count, size)) {
    return ENOMEM;
  }

~~Note that you must use unsigned types in functions like check_mul above! (I.e. the ‘check if result was smaller’ style.)~~ If you use signed types, the multiplication may overflow and will thus be undefined, and the compiler will therefore typically assume that overflow cannot happen. Then it will likely ‘optimize’ the code by removing your ‘dead’ if block — removing your safety check.

Update: Inevitably, sigh 😅, I got this wrong, which I would have noticed if I had done an exhaustive check of uint32_t as I did for another problem, below. (There’s a lesson there!) Jann Horn points out that “you can get a multiplication overflow and still have a result that is bigger than the two operands”, e.g. 0x10000 * 0x11000 = 0x110000000. Thank you, Jann! Let’s stick with stdckdint.h or compiler intrinsics.

If you need to check signed arithmetic, you must check limits before doing the potentially undefined operation. For example:

#include <stdbool.h>
#include <stddef.h>

bool checked_mul(int* result, int x, int y) {
  if (x == 0 || y == 0) {
    *result = 0;
    return false;
  }
  if (x < 0 || y < 0) {
    // TODO: You have even more work to do. See e.g.
    // https://github.com/coreutils/gnulib/blob/master/lib/intprops-internal.h#L370
    // for a type-generic macro that handles all cases. `ckd_mul` and the
    // non-standard intrinsics are lookin’ pretty good right about now... 😉
    abort(); 
  }
  if (INT_MAX / x < y || INT_MAX / y < x) {
    return true;
  }
  *result = x * y;
  return false;
}

If that all sounds like a pain in the ass (because it is), you can use calloc. In C23 (see section 7.24.3.2 of the C23 draft), and in responsible implementations going back 15+ years, calloc is defined to check the count * size multiplication and to return NULL if the product is too large ③.

Basically, do not use malloc directly. Idiomatic usage is typically incorrect and unsafe. Security reviewers like to find some easy pickins by looking at a codebase, running grep -ri alloc *, and looking for overflowing arithmetic expressions. They can find a lot of fun stuff. (Try it yourself! You can do much more along these lines with weggli by Felix Wilhelm et al.)

These Bugs Are Old And Subtle

In Extra, Extra - Read All About It: Nearly All Binary Searches and Mergesorts are Broken, by Joshua Bloch in 2006, we learn that even the simple binary search turns out to be buggy due to the use of the wrong subscript type. In 2 languages, no less! Java was defined to use int as its array index type, and int is defined as a signed 32-bit integer. Unlike C, Java int is defined to have modular behavior on overflow, like C’s unsigned types.

The buggy line is in finding a new midpoint for the search:

int mid = (low + high) / 2;

In C, this is UB and there are no guarantees whatsoever. (Bloch says that “In C this causes an array index out of bounds with unpredictable results”, but spatial unsafety is the outcome only if the people affected by your program are ‘lucky’ and the arithmetic overflow is implemented to have modular behavior. But there’s no guarantee of that, so the C version of this program has extra bonus UB: arithmetic overflow and buffer overflow.)

In Java, the code is incorrect but safe, because the arithmetic overflow is defined and the invalid array access is defined. This is why I like to say, “Java actually is what people imagine C to be.” What is UB in C is very often well-defined in Java.

That said, Java is still wrong to have used int as the array index type, even if only because it is an unnecessary limitation on program data size (as discussed above). But it also leads to this incorrectness problem — in Java, the program will throw when it actually could have successfully completed the binary search. Bloch notes that 1 way to fix it is to use (you guessed it) unsigned arithmetic:

Probably faster, and arguably as clear is:
int mid = (low + high) >>> 1;
In C and C++ (where you don’t have the >>> operator), you can do this:
mid = ((unsigned int)low + (unsigned int)high)) >> 1;

Those approaches will work when low and high are ints and thus their sum will always fit in unsigned int.

But if we used the correct index type, size_t, we don’t have the extra headroom of unsigned — it’s already unsigned — so we need to actually ensure that the overflow does not happen. Bloch’s first solution,

int mid = low + ((high - low) / 2);

is the only one that yields the correct midpoint in all cases.

To see why, consider an extreme case, in which we are indexing a giant array of bytes that holds all bytes in the address space. We can see that calculating the true midpoint requires care in modular arithmetic, both for 32- and 64-bit sizes. (For the 64-bit case, pretend for the moment that we can afford that much RAM, and that we have a machine that actually does use all 64 bits for addressing bytes.)

% cat midpoint.c
#include <assert.h>
#include <inttypes.h>
#include <limits.h>
#include <stdio.h>

int main() {
  {
    static_assert(sizeof(unsigned) == sizeof(uint32_t),
                  "Assuming `unsigned` is `uint32_t`");
    printf("32-bit:\n");
    const uint32_t low = UINT_MAX - 2;
    const uint32_t high = UINT_MAX;
    uint32_t mid = (low + high) / 2;
    printf("(%" PRIu32 " + %" PRIu32 ") / 2 = %" PRIu32 "\n", low, high, mid);

    mid = low + ((high - low) / 2);
    printf("%" PRIu32 " + ((%" PRIu32 " - %" PRIu32 ") / 2) = %" PRIu32 "\n",
           low, high, low, mid);
  }

  {
    static_assert(sizeof(size_t) == sizeof(uint64_t),
                  "Assuming `size_t` is `uint64_t`");
    printf("64-bit:\n");
    const uint64_t low = SIZE_T_MAX - 2;
    const uint64_t high = SIZE_T_MAX;
    uint64_t mid = (low + high) / 2;
    printf("(%" PRIu64 " + %" PRIu64 ") / 2 = %" PRIu64 "\n", low, high, mid);

    mid = low + ((high - low) / 2);
    printf("%" PRIu64 " + ((%" PRIu64 " - %" PRIu64 ") / 2) = %" PRIu64 "\n",
           low, high, low, mid);
  }
}
% make midpoint && ./midpoint
clang -Weverything -Werror -std=c2x midpoint.c -o midpoint
32-bit:
(4294967293 + 4294967295) / 2 = 2147483646
4294967293 + ((4294967295 - 4294967293) / 2) = 4294967294
64-bit:
(18446744073709551613 + 18446744073709551615) / 2 = 9223372036854775806
18446744073709551613 + ((18446744073709551615 - 18446744073709551613) / 2) = 18446744073709551614

Obviously, on a 64-bit machine, we are highly unlikely to find ourselves exercising this edge case. ☺️ (Although don’t get too complacent.) But as you can see, with 32-bit size_t, we are well within range of trouble. (See demo below.)

In any case, I would rather have such foundational functions as binary search be correct because they are correct, rather than ‘good enough‘ contingent on the limitations of the platform the program runs on.

That said, on a 64-bit machine, we can emulate and exhaustively test what would happen in a 32-bit machine if we were to use uint32_t to emulate size_t as the index type. My example program, exhaustive.c, creates an array with 4 billion elements, and then attempts to use binary search to find all of them. It shows that if you use an unsigned index type, you have to use (as midpoint.c suggests) the low + ((high - low) / 2) method to get a correct program. Using the correct arithmetic takes about 6 minutes on my laptop:

% date ; time ./exhaustive c ; date
Sat Feb 11 22:35:47 PST 2023
Created sorted array of `uint32_t` values.
./exhaustive c  347.17s user 8.99s system 95% cpu 6:13.90 total
Sat Feb 11 22:42:01 PST 2023

The incorrect arithmetic gets stuck in an infinite loop, unable to find a valid midpoint (as midpoint.c suggested will happen). In this example, I gave up after 18 minutes:

% date ; time ./exhaustive i ; date
Sat Feb 11 22:42:29 PST 2023
Created sorted array of `uint32_t` values.
^C
./exhaustive i  1031.42s user 18.88s system 95% cpu 18:22.13 total

Conclusion

Types matter. That ‘everything is an int in C’ is no more true than that ‘everything is a list in Lisp’ — and we should be glad of that! The nearly typeless B language, and the untyped lambda calculus, are insufficient as programming tools. Application-domain types clearly matter, but primitive types matter too — perhaps especially, since higher-level types and functions are built on the primitives.

Proofs of correctness of binary search are valid only on the assumption that languages can easily express correct integer arithmetic. (See this one, for example.) In fact, of course, our programming languages make expressing even simple arithmetic incredibly difficult and unwieldy. A complete proof must take the failures of real-world language design into account. This gap between theory and practice, between computing science and software engineering, is wider than people sometimes realize ④.

As for the not-fully-correct solution being ”probably faster”, that is unlikely to be significant. The practical performance difference between mid = ((unsigned int)low + (unsigned int)high)) >> 1 and int mid = low + ((high - low) / 2) is 1 additional subtraction operation on registers, i.e. 1 machine cycle. (The / 2 gets optimized to >> 1 on a modern optimizing compiler.) In a loop involving non-local accesses to main memory — you’re jumping around in the array, not processing it in a linear, cache-friendly way — that fraction of a nanosecond is not going to make or break the performance-fitness of your program.

Assuming we’re already using the best available data structures and algorithms, the most significant way to make programs faster is by increasing parallelism. In general, correctness and safety are crucial to achieving parallelism. Observing the distinctions between types is a particularly effective way to improve program correctness and safety.

In any case, we’ve been living with unnecessarily buggy C/C++ programs since 1989, even accounting for all the other kinds of bugs inherent and special to C/C++. 33 years of entirely preventable bugs and exploitable vulnerabilities, all because it’s too hard to express the 1 thing computers actually do: arithmetic.

A good engineer never blames their tools. But a good engineer is always searching for the best available tools.

1. K&R, 2nd edition, page 187 shows a version of malloc taking unsigned, while p. 167 has it correct as size_t. It seems likely that K&R just forgot to update the example on p. 187 when updating the book for the 2nd edition, which was updated to describe C89.

2. When interviewing security-focused engineers, I often ask them to spot the bugs in code like this, and to explain how it could be exploited.

3. If application-specific macro-benchmarks and testing show that calloc’s zeroing memory makes the program too slow for its purpose, you can define your own allocation function that checks the multiplication and then passes the result to malloc. Of course, this point only applies if your application has performance fitness tests. Unless you have such tests, calloc is not too slow.

4. It reminds me a bit of how with big-O notation, we ignore constant factors because in principle they don’t matter — but in practice, for actual inputs to algorithms implemented on actual computers, the constant factors can be decisive.

文章来源: https://noncombatant.org/2023/02/12/int-size-t
如有侵权请联系:admin#unsafe.sh