thecodingidiot.com

Building CWriting Sort

Writing Sort

We need a sort utility. The spec is short: take a filename, read its lines, sort them alphabetically, print them. Five tools later we will have a clean version of this. Right now we just want it to work.

Open sort.c in your editor and follow along. The first version is a single file, top to bottom.

The skeleton

Every C program starts the same way: includes, then main. Sort needs three standard headers:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

<stdio.h> for printing and reading files, <stdlib.h> for dynamic memory and sorting, <string.h> for string handling.

Now main. We have not used argc or argv before — these are how a program receives its command-line arguments. argv is an array of strings: argv[0] is the program name, argv[1] is the first thing the user typed after it, and so on. argc is how many of those entries there are.

int main(int argc, char **argv)

char ** is a pointer to a pointer to a char. For now, read it as the C syntax for array of strings — each string is a char *, and the array holding them is char **. The c-tier covers what pointers really are. Right here, argv is just a list of strings.

If the user did not pass a filename, complain and exit:

if (argc < 2) {
    fprintf(stderr, "usage: %s <file>\n", argv[0]);
    return (1);
}

fprintf(stderr, ...) writes to standard error rather than standard output. Error messages belong on stderr so they do not get mixed into the program's actual output, which is going to stdout.

Opening the file

fopen opens a file and returns a FILE * — a pointer the rest of the standard I/O functions use to know which file we mean. The second argument is the mode; "r" is read.

in = fopen(argv[1], "r");

That is it for setup. Onward.

Reading the lines

We need somewhere to put the lines as we read them. Sort cannot know how many lines a file has until it has read them all, so a fixed-size array will not do. We need a dynamic array — one that grows as we add to it.

The C standard library has three calls for this:

  • malloc(n) allocates a fresh n-byte block of memory and returns a pointer to it.
  • realloc(p, n) resizes an existing block to n bytes.
  • free(p) releases a block when you are done with it.

All three live in <stdlib.h>. They are the heart of dynamic memory in C, and the c-tier covers them in detail. For sort we just need to know what they do.

For sort, the dynamic array is an array of strings — char **, the same syntax argv uses. We will start it empty and grow it as we read. We also need a fixed-size buffer to read each line into before copying it somewhere permanent.

C wants every variable declared at the top of the function, so we collect all of main's variables — including the FILE *in we used in the line above — into one block. size_t is an unsigned integer type from <stddef.h> used for sizes and array indices.

FILE    *in;
char    **lines;
char    *copy;
size_t  count;
size_t  capacity;
size_t  new_capacity;
size_t  i;
char    buf[4096];
 
lines = NULL;
count = 0;
capacity = 0;

count is how many lines we have so far. capacity is how much room the array currently has. lines is the array itself, initially nothing — NULL is C's "this pointer points at no memory yet."

Before the code, the shape of the algorithm:

The read loop uses fgets, which reads one line at a time:

while (fgets(buf, sizeof(buf), in)) {
    buf[strcspn(buf, "\n")] = '\0';

fgets returns NULL when the file ends, so the while exits naturally. The line it reads keeps its trailing newline, and we do not want that in the sorted output, so we strip it: strcspn finds the offset of the first newline in buf, and we overwrite that newline with \0 (the C string terminator).

Inside the loop, grow the array if it is full, then copy the line into a fresh allocation:

    if (count == capacity) {
        if (capacity == 0)
            new_capacity = 16;
        else
            new_capacity = capacity * 2;
        lines = realloc(lines, new_capacity * sizeof(char *));
        capacity = new_capacity;
    }
    copy = malloc(strlen(buf));
    strcpy(copy, buf);
    lines[count++] = copy;
}

Doubling the capacity each time the array fills is a standard trick — it keeps the amortised cost of growing the array small. realloc(NULL, ...) is equivalent to malloc, which is why starting lines as NULL and reusing the same line works for the very first allocation.

malloc(strlen(buf)) reserves a fresh block the size of the line. strcpy(copy, buf) copies the line text into it. lines[count++] = copy stores the new pointer at the end of the array, then advances count by one. lines[i] is C's array indexing — the same syntax you would use on a fixed-size array.

Once the loop exits, close the file:

fclose(in);

Sorting

The standard library has qsort for sorting any array. It takes four arguments: the array, the number of elements, the size of each element, and a comparison function. The comparator takes two const void * arguments and returns negative, zero, or positive depending on whether the first should sort before, equal to, or after the second.

For sorting strings, the comparator is short — call strcmp and return what it returns:

static int cmp_strings(const void *a, const void *b)
{
    const char *const *sa = a;
    const char *const *sb = b;
    return (strcmp(*sa, *sb));
}

The array elements are char * — strings. qsort does not know that; it just hands the comparator two generic pointers and asks for an order. The two-line cast in cmp_strings tells the compiler what those pointers really point at, and *sa and *sb then read the strings out — the * operator means "the thing this pointer points at." Once we have the two strings, strcmp does the actual comparison.

That is more new C than the rest of the file put together: the * operator, void *, function pointers. The c-tier comes back to all three. Right here, the shape above is the standard pattern for sorting strings in C and works the same way every time. Type it, move on.

The call itself is one line:

qsort(lines, count, sizeof(char *), cmp_strings);

Printing and cleaning up

Print every line to stdout:

for (i = 0; i < count; i++)
    printf("%s\n", lines[i]);

lines[i] is the string at position i in the array — the same indexing as any array. We added the \n back because we stripped the original.

And free what we allocated:

free(lines);
return (0);

The whole thing

Put it all together:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
static int cmp_strings(const void *a, const void *b)
{
    const char *const *sa = a;
    const char *const *sb = b;
    return (strcmp(*sa, *sb));
}
 
int main(int argc, char **argv)
{
    FILE    *in;
    char    **lines;
    char    *copy;
    size_t  count;
    size_t  capacity;
    size_t  new_capacity;
    size_t  i;
    char    buf[4096];
 
    if (argc < 2) {
        fprintf(stderr, "usage: %s <file>\n", argv[0]);
        return (1);
    }
    in = fopen(argv[1], "r");
    lines = NULL;
    count = 0;
    capacity = 0;
    while (fgets(buf, sizeof(buf), in)) {
        buf[strcspn(buf, "\n")] = '\0';
        if (count == capacity) {
            if (capacity == 0)
                new_capacity = 16;
            else
                new_capacity = capacity * 2;
            lines = realloc(lines, new_capacity * sizeof(char *));
            capacity = new_capacity;
        }
        copy = malloc(strlen(buf));
        strcpy(copy, buf);
        lines[count++] = copy;
    }
    fclose(in);
    qsort(lines, count, sizeof(char *), cmp_strings);
    for (i = 0; i < count; i++)
        printf("%s\n", lines[i]);
    free(lines);
    return (0);
}

Compile and run it on a tiny input:

printf 'banana\ncherry\napple\n' > test.txt
gcc sort.c -o sort
./sort test.txt

You should see:

apple
banana
cherry

Looks fine. Ship it.