Data Structures: The Bit Array
A common task that you will need to accomplish in programming is to represent a large array of states. For example, you may want to tag each element in a list with a "status" bit to indicate some information about that record. Perhaps whether it is deleted or not, or whether it has been loaded from disk or not, etc. The obvious approach to accomplish this would be to attach a boolean to each element within the array.
For example, consider the following Python implementation,
class array_element:def __init__(self, value):self.value = valueself.deleted = falsedef delete(self):self.deleted = truex = [array_element(1), array_element(2), array_element(3), ...]
In this code, we could delete an element by calling its delete method, and
thus setting its deleted attribute to true. Code processing the array, then,
can recognize it as deleted and handle it appropriately.
This approach has many virtues, but it does couple the data (value)
with the metadata (deleted) into a single object. We could instead
separate the data from the metadata by maintaining two different arrays,
x = [1, 2, 3, ...]x_deleted = [False, False, False, ...]
which greatly simplifies the code. Now we can mark an element in x as deleted
by assigning True to its corresponding index within x_deleted.
# delete element ix_deleted[i] = True# check if element i is deletedif x_deleted[i]:# x[i] has been deleted
Bundle these two arrays together into an object to abstract away the details of manipulating them, and we're good to go!
Except, you may notice that there is a bit of inefficiency in this process. A boolean value carries only 1 bit of information (true/false, 0/1, on/off, high/low, etc.), but in many programming languages requires a lot more than 1 bit of memory to represent. This is necessary when you're working with individual booleans, because computer memory isn't actually bit-addressable. But, if we're packing a whole bunch of booleans together, we may be able to compact them into a more space-efficient representation, where each element requires only a single bit of storage. This is the basic idea of a bit array.
Bitwise Data Manipulation
Before we can dive into how to construct a bit array, we'll first need to address a few low-level operators that most programming languages implement for manipulating binary data directly. These are often called bitwise operators, and there are four in particular that we will use in our bit array implementation. These are,
- bitwise AND
- bitwise OR
- bitwise left shift
- bitwise complement
Bitwise AND (often the & operator) is a binary operation which accepts two
numbers. It compares them, bit-by-bit, and creates a new number where each
bit is the result of an AND operation between the corresponding two bits
in the input (that is, the i'th bit in the output will be 1 if and only if
the i'th bit in both inputs is 1). As an example, consider 0b00111100 & 0b01010100.
I'll line them up so you can easily see the bit-correspondences,
0b00111100 (first operand) & 0b01010100 (second operand) 0b00010100 (result)
Bitwise OR (often the | operator) works the same way, except it takes the
OR of each pair of bits. The i'th output bit will be 1 if and only if either
(or both) of the i'th bits in the operands are 1. We'll do the bitwise OR
of the same two numbers: 0b00111100 | 0b01010100
0b00111100 (first operand) | 0b01010100 (second operand) 0b01111100 (result)
The way I like to remember these is as follows,
A bit in the output of a bitwise AND will be 1 if the corresponding bit in both the inputs is 1, and 0 otherwise. A bit in the ouput of a bitwise OR will be 0 if the corresponding bit in both of the inputs is 0, and 1 otherwise.
The bitwise left shift (often the << operator) is a binary
operator that takes two numbers. The first is a sequence of bits to
be shifted, and the second is the number of shifts to perform. The
basic idea of the left shift is that, for each shift, a 0 is added
on the right-hand side of the number (the least significant bit),
and the most significant bit is discarded, to ensure that the total
number of bits stays the same. For example, 0b10010011 << 1,
<< 1 0b10010011 (input)
0b00100110 (output [one shift])
Finally, the bitwise complement (often the ~ operator) is a unary operator
that takes a single number and inverts it. All of its 0s become 1s, and vice
versa. For example, ~0b00110011:
~ 0b00110011 (input) 0b11001100 (output)
Applications the Bitwise Operators
If it isn't immediately apparent how these operators will be useful to us, let's map them directly onto useful functions that we will need in order to manage our bit array. We will need to support the following operations at a bit level,
- Test whether a bit is set
- Set a bit (make it a 1)
- Unset a bit (make it a 0)
For the sake of convenience in our implementation, let's also require that the set and unset operations are idempotent. This means that repeatedly applying them will not alter the result compared to only applying them once. If a bit is already set, then applying the set operation should not change anything. Likewise, if a bit is already unset, then applying the unset operation shouldn't change anything.
Let's start by considering a bit array of only a single byte (represented by,
say, the char datatype in C or C++). Later we'll see how we can construct an
arbitrarily large bit array.
Testing a Bit
We can use the bitwise AND operator to test if a bit is set. Construct a byte which contains a 1 in only the bit of interest, and AND it with the byte we want to test. We'll have two cases,
- The bit is set
- The bit is not set
0b11001010 & 0b00001000 0b00001000
See the non-zero result of the operation? Were the bit we were testing to not be set,
0b11000010 & 0b00001000 0b00000000
the result will be 0. So this suggests a pretty simple test for the status of a bit within a byte using bitwise AND,
if ( byte & test_mask ) then--the bit is setelse--the bit is not setend
where byte is the bit array, and test_mask is a byte that has been
constructed to have a single 1 bit, at the index we want to test.
Setting a Bit
Likewise, we can use a bitwise OR to set an arbitrary bit,
0b11000010 | 0b00001000 0b11001010which suggests the following solution to set a bit within our byte,
byte = byte | set_mask
where byte is the bit array and set_mask is a byte that has been
constructed to have a single 1 bit, at the index we want to set
Note that this set operation is idempotent. If we repeatedly execute the above
code with the same set_mask, the end result will remain the same as if we had
executed it one time.
Unsetting a Bit
While setting a bit is pretty straightforward, unsetting is a bit tricker. What we want is an operation that will create a 0 in the output if and only if both of the input bits are 1. We don't have an operation in our toolkit that will do this directly. [1]
The trick here is to use the bitwise complement on the
mask first. To demonstrate, say we want to unset the 5th
bit of our bit array, then our mask and its complement are,
0b00010000 (mask) 0b11101111 (complement)
If we then AND this complement with the bit array, the result will be a byte which has all of the same bits set as the original, with the exception of the bit we are trying to unset. So our unset procedure is then,
byte = byte & (~unset_mask)
Constructing the Mask
We now have code for performing all of the main operations we need, however these operations are all defined in terms of a mask--a special byte that has a single 1 bit in the correct spot. So we are going to need some way to create this mask, as well.
As it turns out, this is simple enough to do using a bitwise left shift. The
number 1, in binary, is 0b00000001. So if we want to test the fifth bit,
we can simply take our 1 and shift it four times,
0b00000001 (1) 0b00010000 (1 after << 4)
So the creation of the mask is pretty straightforward,
mask = 1 << index
will give us the mask corresponding to a given bit, by index. Note that this works if we number the bits from 0 to 7. If you'd rather use 1-8, then you need to subtract 1 from index first, before shifting.
Single-byte Bit Array Implementation
With these operations in mind, we can now create a simple implementation of a single byte bit array. Note that, as this array contains only a single byte, it can only store 8 values. In the next section we'll discuss expanding this to support arbitrary lengths. The implementation here is given in C, in a procedural style. Note that I haven't included any error handling, so this code isn't "production-ready", it's just a simple example.
typedef struct bitarray {char data;} bitarray;void create_bitarray(){bitarray *array = malloc(sizeof(bitarray));array->data = 0; // initialize to 0 explicitly (all bits unset)return array;}void set_bit(bitarray *array, size_t index){char mask = 1 << index;array->data = array->data | mask;}void unset_bit(bitarray *array, size_t index){char mask = 1 << index;array->data = array->data & (~mask);}bool test_bit(bitarray *array, size_t index){char mask = 1 << index;return array->data & mask;}void delete_bitarray(bitarray *array){free(array);}
Multi-byte Bit Arrays
Okay, so we now have a working single-byte bit array. Can we extend this to support an arbitrary number of bytes? Sure! It isn't even all that hard.
We'll still use a byte datatype to store our bits, but rather than just a single byte, we'll use an array of them. This means that, when we go to access a given bit, we need to figure out which byte contains that bit, and then which bit within the byte we want. So we will need to convert our single index (I want bit 20), into two different indices (byte index and bit index).
The trick is to note that each byte contains 8 bits. So we can get the byte number by simply dividing (integer division, mind, Python3 users beware) the input index by 8. In the case of 20 mentioned earlier then, we have as our byte index. To get the bit within that byte, we then need to calculate the remainder of that same division, . So the bit at index 20 is located in the bit at index 4 within the byte at index 2. In code terms, this gives us,
byte_index = index / 8bit_index = index % 8
The byte index tells us which byte to use, and then the bit index is what we use to calculate the mask.
Note then that this also means that when we go to allocate the array of bytes to represent our bit array, we will end up wasting a little space. This is because we need to allocate space in chunks of 8 bits. So, if we wanted an array that would store 21 bits, we will actually need to allocate 24 bits (3 bytes), , to ensure we have enough room. This is a wastage of 3 bits, but that is nothing compared to the wastage that would have accompanied creating an array of 21 booleans (each of which could be as large as 4 bytes apiece!)
In any case, accounting for all of this allows us to construct a multi-byte bit array. Again, the following code is in C and lacks necessary error handling for simplicity,
typedef struct bitarray {char data*;size_t logical_size;size_t physical_size;} bitarray;bitarray *create_bitarray(size_t size){bitarray *array = malloc(sizeof(bitarray));bitarray->logical_size = size;bitarray->physical_size = ceil((double) size / 8.0);array->data = calloc(array->physical_size);return array;}void set_bit(bitarray *array, size_t index){size_t byte_index = index / 8;size_t bit_index = index % 8;char mask = 1 << bit_index;array->data[byte_index] = array->data[byte_index] | mask;}void unset_bit(bitarray *array, size_t index){size_t byte_index = index / 8;size_t bit_index = index % 8;char mask = 1 << bit_index;array->data[byte_index] = array->data[byte_index] & (~mask);}bool test_bit(bitarray *array, size_t index){size_t byte_index = index / 8;size_t bit_index = index % 8;char mask = 1 << bit_index;return array->data[byte_index] & mask;}void delete_bitarray(bitarray *array){free(array->data);free(array);}
Conclusion
And there you have it, all the basic building blocks needed to put together your own bit array. It isn't a very complicated data structure, but it is quite useful, and not often discussed in college data structures courses. Our structure supports checking the status of a bit, as well as setting and unsetting bits in an idempotent manner, all in constant time. It is fixed size, but that would be pretty easy to adjust by using a dynamic array type, rather than a fixed one, for storing the bytes.
Some other possible extensions to this structure would be clustering bits into logical groups. For example, you may want to have each index point to 3 bits of status information, rather than just 1. Or perhaps you want to be able to bulk operate on chunks of the map, or unset the entire thing in one call. Once you understand the basics of how bit arrays work, all of these operations are pretty straightforward to implement, and I'd encourage you to try your hand at adding them in.
A Note on the C++ Standard Library
As a final note, a lot of languages have a bit array of some sort
in their standard library. One particular tricky case, which I want
to mention here, is C++. A quick glance at the documentation for the
standard library turns up the bitset type, which looks and
feels a lot like the bit array we just created. Be advised, however,
that there is a fundamental difference: the size of a bitset
must be known at compile time.
If you need to construct it at runtime instead, then you
should use the boolean specialization of the standard vector type,
std::vector<bool>. This datatype compacts the booleans
to take up a single bit, as we did here. Annoyingly, this makes the
std::vector<bool> magically behave differently from
all the other versions of std::vector, but I suppose
this isn't the place to rant about the C++ standard library and its
idiosyncrasies. If you're using it, and you want a bit array, that is
the structure to use.