Lab 3: A little data representation

This lab will take place in two parts. First, we’ll look at character encodings, and then move on to exploring how integers are stored.

There will likely be parts of this lab that you don’t understand right away. Keep reading, keep experimenting, collect your questions, and ask them! Gradually, things will start to make more sense.

Part 1: Character encodings

Do the following. Remember to collect questions, play around, and take notes. After about 20 minutes, we’ll discuss.

  • Connect to mantis in VS Code.

  • Create a new text file named something.txt. Type two or three short lines of ASCII text (e.g., just a few English words) and save your file.

  • In the VS Code terminal, make sure you are cd’d into the directory containing something.txt and run this command:

    hexdump -C something.txt
    
  • Do the hex values you see in your file correspond to the characters you entered? Do the characters come in the order you expect? Do you see any newline characters? (You may find it helpful to open an ASCII chart in a browser tab or view the chart in a terminal by running man ascii.)

  • Copy the word “résumé” and the Greek letters “αβγδ” from this page into something.txt. Note that é and the Greek letters are not ASCII characters. Save and run hexdump -C something.txt again. Which bytes correspond to é, α, β, γ, and δ?

Now, let’s explore some alternate encodings.

  • On the bottom-right status bar of VS Code, click on UTF-8, select Save with encoding, and then select UTF-16 LE.

  • Run hexdump again. What changed? Which bytes correspond to which characters?

  • Again at the bottom right of VS Code, click on UTF-16 LE, select Save with encoding, and then select UTF-16 BE.

  • Run hexdump again. What changed? Which byes correspond to which characters? How is this different from UTF-8? What about UTF-16 LE?

  • You are hopefully starting to make some sense of the difference between Unicode codepoints and the character encodings using different encoding formats, like UTF-8, UTF-16 LE, and UTF-16 BE. If these are at all fuzzy, do a little internet exploration (after class) to figure out the differences between them.

Part 2: Integers

For this part, you’ll again work through a set of instructions, and answer some questions by writing C code.

a) Representing integers

  • Grab a copy of integer_rep.c and save it in your mantis working directory.

  • Read it, predict what it will do, and run it:

    gcc -Wall -Werror -o integer_rep integer_rep.c
    ./integer_rep > output.txt
    
  • Display output.txt’s file contents using hexdump. How does what the C program did correspond to what you see in the output file?

  • As an aside, what did the > symbol do in the command above?

  • In integer_rep.c, change j = 25 to j = -25, save, recompile, rerun, and check the output again with hexdump. What changed? Why did it change exactly like that?

b) A handy tool

If you want to know the exact bits contained in an int, do this:

int j = 314;
printf("0x%08X\n", j);

It gets slightly weirder for long (note the l before the X):

long k = 314159;
printf("0x%016lX\n", k);

It gets even weirder for char, as we’ll explore shortly.

c) Some questions

Take some time to try using the sizeof C function (which isn’t actually a function, but it behaves enough like one that we’ll pretend it is) to answer the following questions.

  • How many bytes are in an int?

  • How many bytes are in a long?

  • How many bytes are in a char?

  • How many bytes are in an unsigned int?

  • How many bytes are in an unsigned long?

  • How many bytes are in an unsigned char?

  • If you do this:

    int j = -1;
    unsigned k = -1;
    

what bits are in j? What about k?

  • What is going on here?!?

    char c1 = 0x41;
    printf("c1 as char: %c\n", c1);
    printf("c1 as decimal int: %d\n", c1);
    printf("c1 as hexadecimal int: %X\n", c1);
    
    char c2 = 0xCE;
    printf("c2 as char: %c\n", c2);
    printf("c2 as decimal int: %d\n", c2);
    printf("c2 as hexadecimal int: %X\n", c2);
    
  • What about this one?!?

    int s = -1;
    int t = (s >> 4);
    printf("s (-1): 0x%08X\n", s);
    printf("s >> 4: 0x%08X\n", t);
    
  • Do the same thing as before, but with s and t declared as unsigned. Before you run it, make a prediction as to whether it will be different or the same. Was your prediction correct?