Data Structures: The Bit Array

A compact representation of an array of booleans

A common task that you will need to accomplish in programming is to represent a large array of states. For example, you may want to tag each element in a list with a “status” bit to indicate some information about that record. Perhaps whether it is deleted or not, or whether it has been loaded from disk or not, etc. The obvious approach to accomplish this would be to attach a boolean to each element within the array.

For example, consider the following Python implementation,

class array_element:
    def __init__(self, value):
        self.value = value
        self.deleted = false

    def delete(self):
        self.deleted = true


x = [array_element(1), array_element(2), array_element(3), ...]

In this code, we could delete an element by calling its delete method, and thus setting its deleted attribute to true. Code processing the array, then, can recognize it as deleted and handle it appropriately.

This approach has many virtues, but it does couple the data (value) with the metadata (deleted) into a single object. We could instead separate the data from the metadata by maintaining two different arrays,

x = [1, 2, 3, ...]
x_deleted = [False, False, False, ...]

which greatly simplifies the code. Now we can mark an element in x as deleted by assigning True to its corresponding index within x_deleted.

# delete element i
x_deleted[i] = True

# check if element i is deleted
if x_deleted[i]:
    # x[i] has been deleted    

Bundle these two arrays together into an object to abstract away the details of manipulating them, and we’re good to go!

Except, you may notice that there is a bit of inefficiency in this process. A boolean value carries only 1 bit of information (true/false, 0/1, on/off, high/low, etc.), but in many programming languages requires a lot more than 1 bit of memory to represent. This is necessary when you’re working with individual booleans, because computer memory isn’t actually bit-addressable. But, if we’re packing a whole bunch of booleans together, we may be able to compact them into a more space-efficient representation, where each element requires only a single bit of storage. This is the basic idea of a bit array.

Bitwise Data Manipulation

Before we can dive into how to construct a bit array, we’ll first need to address a few low-level operators that most programming languages implement for manipulating binary data directly. These are often called bitwise operators, and there are four in particular that we will use in our bit array implementation. These are,

  1. bitwise AND
  2. bitwise OR
  3. bitwise left shift
  4. bitwise complement

Bitwise AND (often the & operator) is a binary operation which accepts two numbers. It compares them, bit-by-bit, and creates a new number where each bit is the result of an AND operation between the corresponding two bits in the input (that is, the i’th bit in the output will be 1 if and only if the i’th bit in both inputs is 1). As an example, consider 0b00111100 & 0b01010100. I’ll line them up so you can easily see the bit-correspondences,

  0b00111100  (first operand)
& 0b01010100  (second operand)
  0b00010100  (result)

Bitwise OR (often the | operator) works the same way, except it takes the OR of each pair of bits. The i’th output bit will be 1 if and only if either (or both) of the i’th bits in the operands are 1. We’ll do the bitwise OR of the same two numbers: 0b00111100 | 0b01010100

  0b00111100  (first operand)
| 0b01010100  (second operand)
  0b01111100  (result)

The way I like to remember these is as follows,

A bit in the output of a bitwise AND will be 1 if the corresponding bit in both the inputs is 1, and 0 otherwise. A bit in the ouput of a bitwise OR will be 0 if the corresponding bit in both of the inputs is 0, and 1 otherwise.

The bitwise left shift (often the << operator) is a binary operator that takes two numbers. The first is a sequence of bits to be shifted, and the second is the number of shifts to perform. The basic idea of the left shift is that, for each shift, a 0 is added on the right-hand side of the number (the least significant bit), and the most significant bit is discarded, to ensure that the total number of bits stays the same. For example, 0b10010011 << 1,

<< 1  0b10010011    (input)
      0b00100110    (output [one shift])

Finally, the bitwise complement (often the ~ operator) is a unary operator that takes a single number and inverts it. All of its 0s become 1s, and vice versa. For example, ~0b00110011:

~ 0b00110011    (input)
  0b11001100    (output)

Applications the Bitwise Operators

If it isn’t immediately apparent how these operators will be useful to us, let’s map them directly onto useful functions that we will need in order to manage our bit array. We will need to support the following operations at a bit level,

  1. Test whether a bit is set
  2. Set a bit (make it a 1)
  3. Unset a bit (make it a 0)

For the sake of convenience in our implementation, let’s also require that the set and unset operations are idempotent. This means that repeatedly applying them will not alter the result compared to only applying them once. If a bit is already set, then applying the set operation should not change anything. Likewise, if a bit is already unset, then applying the unset operation shouldn’t change anything.

Let’s start by considering a bit array of only a single byte (represented by, say, the char datatype in C or C++). Later we’ll see how we can construct an arbitrarily large bit array.

Testing a Bit

We can use the bitwise AND operator to test if a bit is set. Construct a byte which contains a 1 in only the bit of interest, and AND it with the byte we want to test. We’ll have two cases,

  1. The bit is set
  2. The bit is not set

In the first case, we’ll have the following,

  0b11001010
& 0b00001000
  0b00001000

See the non-zero result of the operation? Were the bit we were testing to not be set,

  0b11000010
& 0b00001000
  0b00000000

the result will be 0. So this suggests a pretty simple test for the status of a bit within a byte using bitwise AND,

if ( byte & test_mask ) then
    --the bit is set
else
    --the bit is not set
end

where byte is the bit array, and test_mask is a byte that has been constructed to have a single 1 bit, at the index we want to test.

Setting a Bit

Likewise, we can use a bitwise OR to set an arbitrary bit,

  0b11000010
| 0b00001000
  0b11001010

which suggests the following solution to set a bit within our byte,

byte = byte | set_mask

where byte is the bit array and set_mask is a byte that has been constructed to have a single 1 bit, at the index we want to set

Note that this set operation is idempotent. If we repeatedly execute the above code with the same set_mask, the end result will remain the same as if we had executed it one time.

Unsetting a Bit

While setting a bit is pretty straightforward, unsetting is a bit tricker. What we want is an operation that will create a 0 in the output if and only if both of the input bits are 1. We don’t have an operation in a toolkit that will do this directly.1

The trick here is to use the bitwise complement on the mask first. To demonstrate, say we want to unset the 5th bit of our bit array, then our mask, and its complement, are,

0b00010000
0b11101111

If we then AND this complement with the bit array, the result will be a byte which has all of the same bits set as the original, with the exception of the bit we are trying to unset. So our unset procedure is then,

byte = byte & (~unset_mask)

Constructing the Mask

We now have code for performing all of the main operations we need, however these operations are all defined in terms of a mask–a special byte that has a single 1 bit in the correct spot. So we are going to need some way to create this mask, as well.

As it turns out, this is simple enough to do using a bitwise left shift. The number 1, in binary, is 0b00000001. So if we want to test the fifth bit, we can simply take our 1 and shift it four times,

0b00000001  (1)
0b00010000  (1 after << 4)

So the creation of the mask is pretty straightforward,

mask = 1 << index

will give us the mask corresponding to a given bit, by index. Note that this works if we number the bits from 0 to 7. If you’d rather use 1-8, then you need to subtract 1 from index first, before shifting.

Single-byte Bit Array Implementation

With these operations in mind, we can now create a simple implementation of a single byte bit array. Note that, as this array contains only a single byte, it can only store 8 values. In the next section we’ll discuss expanding this to support arbitrary lengths. The implementation here is given in C, in a procedural style. Note that I haven’t included any error handling, so this code isn’t “production-ready”, it’s just a simple example.

typedef struct bitarray {
    char data;
} bitarray

void create_bitarray()
{
    bitarray *array = malloc(sizeof(bitarray));
    array->data = 0; // initialize to 0 explicitly (all bits unset)

    return array;
}

void set_bit(bitarray *array, size_t index)
{
    char mask = 1 << index;
    array->data = array->data | mask;
}

void unset_bit(bitarray *array, size_t index)
{
    char mask = 1 << index;
    array->data = array->data & (~mask);
}

bool test_bit(bitarray *array, size_t index)
{
    char mask = 1 << index;
    return array->data & mask;
}

void delete_bitarray(bitarray *array)
{
    free(array);
}

Multi-byte Bit Arrays

Okay, so we now have a working single-byte bit array. Can we extend this to support an arbitrary number of bytes? Sure! It isn’t even all that hard.

We’ll still use a byte datatype to store our bits, but rather than just a single byte, we’ll use an array of them. This means that, when we go to access a given bit, we need to figure out which byte contains that bit, and then which bit within the byte we want. So we will need to convert our single index (I want bit 20), into two different indices (byte index and bit index).

The trick is to note that each byte contains 8 bits. So we can get the byte number by simply dividing (integer division, mind, Python3 users beware) the input index by 8. In the case of 20 mentioned earlier then, we have $20/8 = 2$ as our byte index. To get the bit within that byte, we then need to calculate the remainder of that same division, $\mod(20, 8) = 4$. So the bit at index 20 is located in the bit at index 4 within the byte at index 2. In code terms, this gives us,

byte_index = index / 8
bit_index = index % 8

The byte index tells us which byte to use, and then the bit index is what we use to calculate the mask.

Note then that this also means that when we go to allocate the array of bytes to represent our bit array, we will end up wasting a little space. This is because we need to allocate space in chunks of 8 bits. So, if we wanted an array that would store 21 bits, we will actually need to allocate 24 bits (3 bytes), $\lceil \text{size} / 8 \rceil$, to ensure we have enough room. This is a wastage of 3 bits, but that is nothing compared to the wastage that would have accompanied creating an array of 21 booleans (each of which could be as large as 4 bytes apiece!)

In any case, accounting for all of this allows us to construct a multi-byte bit array. Again, the following code is in C and lacks necessary error handling for simplicity,

typedef struct bitarray {
    char data*;
    size_t logical_size;
    size_t physical_size;
} bitarray

bitarray *create_bitarray(size_t size)
{
    bitarray *array = malloc(sizeof(bitarray));
    bitarray->logical_size = size;
    bitarray->physical_size = ceil((double) size / 8.0);
    array->data = calloc(array->physical_size);

    return array;
}

void set_bit(bitarray *array, size_t index)
{
    size_t byte_index = index / 8;
    size_t bit_index = index % 8;
    char mask = 1 << bit_index;

    array->data[byte_index] = array->data[byte_index] | mask;
}

void unset_bit(bitarray *array, size_t index)
{
    size_t byte_index = index / 8;
    size_t bit_index = index % 8;
    char mask = 1 << bit_index;

    array->data[byte_index] = array->data[byte_index] & (~mask);
}

bool test_bit(bitarray *array, size_t index)
{
    size_t byte_index = index / 8;
    size_t bit_index = index % 8;
    char mask = 1 << bit_index;

    return array->data[byte_index] & mask;
}

void delete_bitarray(bitarray *array)
{
    free(array->data);
    free(array);
}

Conclusion

And there you have it, all the basic building blocks needed to put together your own bit array. It isn’t a very complicated data structure, but it is quite useful, and not often discussed in college data structures courses (as they aren’t particularly interesting, from a mathematical perspective).

The present structure supports checking the status of a bit, as well as setting and unsetting bits in an idempotent manner, all in constant time. It is fixed size, but that would be pretty easy to adjust by using a dynamic array type, rather than a fixed one, for storing the bytes.

Some other possible extensions to this structure would be clustering bits into logical groups. For example, you may want to have each index point to 3 bits of status information, rather than just 1. Or perhaps you want to be able to bulk operate on chunks of the map, or unset the entire thing in one call. Once you understand the basics (outlined above) of how bit arrays work, all of these operations are pretty straightforward to implement, and I’d encourage you to try your hand at adding them in.

A Note on the C++ Standard Library

As a final note, a lot of languages have a bit array of some sort in their standard library. One particular tricky case, which I want to mention here, is C++. A quick glance at the documentation for the standard library turns up the bitset type, which looks and feels a lot like the bit array we just created. Be advised, however, that there is a fundamental difference: the size of a bitset must be known at compile time.

If you need to construct it at runtime instead, then you should use the boolean specialization of the standard vector type, std::vector<bool>. This datatype compacts the booleans to take up a single bit, as we did here. Annoyingly, this makes the std::vector<bool> magically behave differently from all the other versions of std::vector, but I suppose this isn’t the place to rant about the C++ standard library and its idiosyncrasies. If you’re using it, and you want a bit array, that is the structure to use.


  1. If you already know the bitwise operations, you may think that the solution is one that we haven’t discussed yet, the bitwise-exclusive or (xor, often the ^ operator). This won’t actually do what we want it to though. The bitwise xor will place a 1 in its output where there is a 1 in either of the inputs, but not in both, and a 0 anywhere else. This will work to unset a bit that has been set, but it will fail our idempotence requirement, because applying it to a bit that is already unset will set it! Effectively, xor will function like a toggle–potentially useful, but not what we want here. ↩︎

Related