The Hidden Magic of ASCII Tables

Starting my project of cli tool multask, I needed a way to show data in the terminal and keeping in line with my no libraries restriction (except libc & Win32), I hand-rolled my own table in the terminal that refreshes every second with new data. It turned out to be a massive pain in the neck and made me realise how complex and clever seemingly simple tables in the terminal can be.

Let's start with just printing out a simple table; every row can be a new line, each column can be the length of the largest string in its rows (we'll get back to this!), and we can encapsulate the whole thing in ASCII border characters such as '+' for the corners, '-' for the top and bottom and '|' for the left and right, I won't put a code example here because it's quite verbose and easy to make, this blog post is more around the nuances in terminal tables. But here's what the output would look like:


    +----+-----------+----------------+---------------------+-----+---------+--------+-----+---------+
| id | namespace | command        | location            | pid | status  | memory | cpu | runtime |
+----+-----------+----------------+---------------------+-----+---------+--------+-----+---------+
| 1  | N/A       | echo hi        | F:\Dev\Apps\multask | N/A | Stopped | N/A    | N/A | N/A     |
| 2  | N/A       | node server.js | F:\Dev\Apps\server  | N/A | Stopped | N/A    | N/A | N/A     |
+----+-----------+----------------+---------------------+-----+---------+--------+-----+---------+

Now to clear the table!

Most terminals have something built in called ANSI escape codes, it's a super clever way of being able to manipulate the terminal and changing it into a properly editable environment.

The one I'm interested in is this right here:


    \x1b[A\x1b[2K

If you print this into your terminal, it'll remove the last line:


    printf("\x1b[A\x1b[2K");

This is perfect to refresh data as all I'll need to do is have a timer where every second I iterate over every row, print that escape code, update the table's data and then reprint it! A foolproof plan… Almost.

The table runneth over

Uh oh!! Turns out when my terminal isn't full screen, the table wraps around causing 1 row to be 2 terminal lines leading to only half of the table getting erased! This needed to be fixed, I'm making my program for everyone of all terminal shapes and sizes, how am I going to do this?

What we could do is dip our toes into the low level world of our operating system and use libc or the Win32 API (choosing the Win32 API because it's what I'm on at the moment), get the amount of columns this terminal has and do a little bit of math to see how many lines in the terminal one row would end up taking and just print that anyway.


    #include "math.h"
#include "stdio.h"
#include "windows.h"
#include "winbase.h"

int row_to_terminal_lines(int string_width, int window_cols) {
    float overlap = (window_cols < string_width) ?
      (float) string_width / window_cols : 1;
    return ceil(overlap);
}

float get_terminal_columns() {
    HANDLE out_handle = GetStdHandle(STD_OUTPUT_HANDLE);
    CONSOLE_SCREEN_BUFFER_INFO binfo;
    int res = GetConsoleScreenBufferInfo(out_handle, &binfo);
    if (res == 0) {
        printf("Error while getting console info.\n");
        exit(1);
    }
    return (float) binfo.srWindow.Right - binfo.srWindow.Left + 1;
}

int main() {
    float cols = get_terminal_columns();
    printf(
        "terminal columns = %d\nrows printed = %d\n",
        (int) cols, row_to_terminal_lines(120, cols)
    );
}

Once again, a little harder but not too much to handle.

But what about chinese?

Uni-God dammit

You see this -> 吃 ? On terminals, it's the size of two characters but is made up of 3 bytes which completely throws our way of getting the length of the largest string in the BIN!

I decided to take a break from that part of the code and just let it fester in my mind that there's a slim chance a Chinese person tries my program made with love and care and can't get the tables to work because he was born in the wrong place with the wrong language and he cries out against the world all because of this tedious character system that I don't understand.

And then I took the time to understand it. Well enough to get by and make something for that poor imaginary person in my head.

Let's start with codepoints: when working with non-wide strings (array of u8s instead of u16s) a single unicode character may be split up into multiple bytes, these are called code points. It's a key to lookup a value in the unicode table, and from this key we can tell what the rest of the unicode character will look like for example, if the codepoint's value is within 128,512 and 128,591 you know it's an emoji which will have the width of 2 characters.

This is the key to figuring out what codepoints are going to be wide or not, if we know certain ranges of values will be wide we can check if any codepoints will be within that value, fingers crossed this is easy!


    #include "stdio.h"
#include "windows.h"

struct Utf8Iterator {
    unsigned int i;
    unsigned char* bytes;
};

int utf8ByteSequenceLength(unsigned char byte) {
    if (byte >= 0b11110000 && byte <= 0b11110111) {
        return 4;
    } else if (byte >= 0b11100000 && byte <= 0b11101111) {
        return 3;
    } else if (byte >= 0b11000000 && byte <= 0b11011111) {
        return 2;
    } else if (byte >= 0b00000000 && byte <= 0b01111111) {
        return 1;
    } else {
        printf("Invalid char\n");
        exit(1);
    }
}

int nextCodepointSlice(struct Utf8Iterator* iter) {
    size_t len = strlen(iter->bytes);
    if (iter->i >= len) {
        return -1;
    }

    int code_point_len = utf8ByteSequenceLength(iter->bytes[iter->i]);
    int old_idx = iter->i;
    iter->i += code_point_len;

    return old_idx;
}

int utf8Decode2(unsigned char* bytes) {
    int value = bytes[0] & 0b00011111;

    if ((bytes[1] & 0b11000000) != 0b10000000) {
        printf("Utf8ExpectedContinuation\n");
        exit(1);
    }
    value <<= 6;
    value |= bytes[1] & 0b00111111;

    if (value < 0x80) {
        printf("Utf8OverlongEncoding\n");
        exit(1);
    }

    return value;
}

int utf8Decode3AllowSurrogateHalf(unsigned char* bytes) {
    int value = bytes[0] & 0b00001111;

    if ((bytes[1] & 0b11000000) != 0b10000000) {
        printf("Utf8ExpectedContinuation\n");
        exit(1);
    }

    value <<= 6;
    value |= bytes[1] & 0b00111111;

    if ((bytes[2] & 0b11000000) != 0b10000000) {
        printf("Utf8ExpectedContinuation\n");
        exit(1);
    }

    value <<= 6;
    value |= bytes[2] & 0b00111111;

    if (value < 0x800) {
        printf("Utf8OverlongEncoding\n");
        exit(1);
    }

    return value;
}

int utf8Decode3(unsigned char* bytes) {
    int value = utf8Decode3AllowSurrogateHalf(bytes);

    if (0xd800 <= value && value <= 0xdfff) {
        printf("Utf8EncodesSurrogateHalf\n");
        exit(1);
    }

    return value;
}

int utf8Decode4(unsigned char* bytes) {
    int value = bytes[0] & 0b00000111;

    if ((bytes[1] & 0b11000000) != 0b10000000) {
        printf("Utf8ExpectedContinuation\n");
        exit(1);
    }
    value <<= 6;
    value |= bytes[1] & 0b00111111;

    if ((bytes[2] & 0b11000000) != 0b10000000) {
        printf("Utf8ExpectedContinuation\n");
        exit(1);
    }
    value <<= 6;
    value |= bytes[2] & 0b00111111;

    if ((bytes[3] & 0b11000000) != 0b10000000) {
        printf("Utf8ExpectedContinuation\n");
        exit(1);
    }
    value <<= 6;
    value |= bytes[2] & 0b00111111;

    if (value < 0x10000) {
        printf("Utf8OverlongEncoding\n");
        exit(1);
    }

    if (value > 0x10FFFF) {
        printf("Utf8CodepointTooLarge\n");
        exit(1);
    }

    return value;
}

int utf8Decode(unsigned char* bytes, int len) {
    switch (len) {
        case 1:
            return bytes[0];
        case 2:
            return utf8Decode2(bytes);
        case 3:
            return utf8Decode3(bytes);
        case 4:
            return utf8Decode4(bytes);
        default:
            printf("Bad length %d\n", len);
            exit(1);
    }
}

int nextCodepoint(struct Utf8Iterator* iter) {
    int start_idx = nextCodepointSlice(iter);
    if (start_idx == -1) {
        return -1;
    }
    unsigned char* code_point = iter->bytes + start_idx;
    int code_point_len = iter->i - start_idx;
    return utf8Decode(code_point, code_point_len);
}

int unicodeWidth(int code_point) {
    // C0 && DEL
    if (code_point == 0) return 0;
    if (code_point < 32 || (code_point >= 0x7f && code_point < 0xa0)) return 0;

    // Wide || Fullwidth ranges (based on wcwidth.c && Unicode TR11)
    if ((code_point >= 0x1100 && code_point <= 0x115F) ||
        code_point == 0x2329 || code_point == 0x232A ||
        (code_point >= 0x2E80 && code_point <= 0xA4CF && code_point != 0x303F) ||
        (code_point >= 0xAC00 && code_point <= 0xD7A3) ||
        (code_point >= 0xF900 && code_point <= 0xFAFF) ||
        (code_point >= 0xFE10 && code_point <= 0xFE19) ||
        (code_point >= 0xFE30 && code_point <= 0xFE6F) ||
        (code_point >= 0xFF00 && code_point <= 0xFF60) ||
        (code_point >= 0xFFE0 && code_point <= 0xFFE6) ||
        (code_point >= 0x1F300 && code_point <= 0x1F64F) ||
        (code_point >= 0x1F900 && code_point <= 0x1F9FF) ||
        (code_point >= 0x20000 && code_point <= 0x3FFFD))
    {
        return 2;
    }

    return 1;
}

int main() {
    unsigned char str[16] = "hello你好👋";
    unsigned char* test_str = str;
    struct Utf8Iterator iter;
    iter.i = 0;
    iter.bytes = test_str;

    int code_point = nextCodepoint(&iter);
    while (code_point != -1) {
        int width = unicodeWidth(code_point);
        printf("TERMINAL WIDTH: %d\n", width);

        code_point = nextCodepoint(&iter);
    }
}

Ah well.

If you run this, you'll see each character in the `hello` part of the string has a terminal width of 1, but 你, 好 and 👋 each have a width of 2.

But lets go through it anyways — the iteration through the codepoints is a C interpretation of the way Zig does this but to start with we have our iterator:


    struct Utf8Iterator {
    unsigned int i;
    unsigned char* bytes;
};

The i is the index of the position in the bytes which is our text, and to find the code points we have our `nextCodepoint` function where we pass in the iterator and it finds how many bytes are in the next code point by looking at the first byte, combines those bytes into a 32 bit integer and checks against a large group of if checks to see if that integer is within certain ranges for emojis or east Asian characters etc. And if that integer is in one of those ranges, it gets tagged as a wide unicode character with a terminal length of 2, whereas if that integer isn't within any of those specified bounds, it stays as a regular character with a width of 1.

This code will let us understand the true length of a string with unicode characters. This should be the end… Right?

ANSI Escape Codes II

I hope you haven't forgotten about these. Turns out the ticket to letting me properly refresh my tables would lead to another huge thorn in the side.

Another cool property of these escape codes is they can colour text! I've been using them to colour the text for if a process in my table was either Running or Stopped, coloured green and red respectively. The only issue as I later found out was ANSI codes use multiple characters to tell the browser to not treat this 'command' as regular text, which means to colour a string purple, it can look like this:


    0x1b[38;2;253;182;0mTHIS IS YELLOW0x1b[0m

Don't get me wrong, this is extremely cool however as you can see, I can't just say if a character is within a certain range to count it as 0 length like I did with the wide characters. If I have a yellow string in my table, the column would be far wider than the actual length of the inner text THIS IS YELLOW. I need a way of checking for sequences of characters to add onto my unicode work…


    ... All the previous functions

enum IsAnsiCodeReturn {
    YES,
    YES_OSC, // OSC is a specific subset of ANSI escape codes
    NO,
    NO_SKIP
};

// Returns whether the character is within an ANSI sequence
// so as to not count it.
enum IsAnsiCodeReturn isAnsiCode(unsigned char first, enum IsAnsiCodeReturn state) {
    if (state == NO) {
        if (first == 0x1B) {
            return YES;
        }
        return NO;
    }

    if (first == '[') {
        return YES;
    }

    if (first == ']') {
        return YES_OSC;
    }
    if (state == YES_OSC) {
        if (first == 0x07) {
            return NO_SKIP;
        }
        return YES_OSC;
    }

    if (first < 0x40 || first > 0x7E) {
        return YES;
    }

    if (state == YES) {
        return NO_SKIP;
    }
    return NO;
}

// new getLength function with ANSI
int getLength(unsigned char* str) {
    struct Utf8Iterator iter;
    iter.i = 0;
    iter.bytes = str;
    int length = 0;

    int code_point = -1;
    enum IsAnsiCodeReturn res = NO;
    while (1) {
        code_point = nextCodepoint(&iter);
        if (code_point == -1) {
            break;
        }
        res = isAnsiCode(code_point, res);
        int skip = 0;
        switch (res) {
            case YES_OSC:
            case YES:
                skip = 1;
                break;
            case NO:
                break;
            case NO_SKIP:
                skip = 1;
                break;
        }
        if (skip) {
            skip = 0;
            continue;
        }

        int width = unicodeWidth(code_point);
        length += width;
    }
    return length;
}

And now some tests for what it works on:


    int main() {
    unsigned char str[16] = "hello你👋";
    int length_wide = getLength(str);
    printf("Wide text: %d\n", length_wide);

    unsigned char ansi_str[8] = "\x1b[A\x1b[2K";
    int length_ansi = getLength(ansi_str);
    printf("Erase line ANSI: %d\n", length_ansi);

    unsigned char ansi_str_colour[30] = "\x1b[38;2;255;255;255mHELLO\x1B[0m";
    int length_ansi_colour = getLength(ansi_str_colour);
    printf("Colour text ANSI: %d\n", length_ansi_colour);

    unsigned char osc_str_link[128] = "\x1b]8;;https://example.com\x07Link\x1b]8;;\x07";
    int length_osc_colour = getLength(osc_str_link);
    printf("Link OSC: %d\n", length_osc_colour);
}

The results from running this are:


    Wide text: 9
Erase line ANSI: 0
Colour text ANSI: 5
Link OSC: 4

Which is correct, it still doesn't mess with unicode characters as the text hello你👋 has 5 regular width characters and 2 wide width characters leading to a 9 character length. The erase line ANSI was the sequence that erases a line in the terminal and it has no width in the terminal when printed leading to a 0 character length, the coloured text was what I mostly cared about since that makes sense to be printed in a table, the text HELLO is 5 characters long which matches and the rest of the sequence is ignored. Finally, the OSC escape codes have to be treated separately as they don't follow the rules of regular ANSI codes, but our code has accommodated this where it embeds the URL of https://example.com into the text Link to act as a hyperlink and our code reads it as 4 characters long, which it is.

And this is the reasonable point at where I stopped for my string length shenanigans, after all, I'm only writing an ascii table, those should be simple… Right?

The code above was my conversion from the Zig implementation into C, I'm not a C expert so any bugs or errors please let me know!

Thanks for reading!

Comments

Thank you for commenting!!

No comments here...