HTTP
In the previous chapter, I mentioned the large number of application protocols that exist. You probably make use of many such protocols when you use the Internet, but in this chapter we’re going to focus on the king of application protocols: the HyperText Transfer Protocol (HTTP).
Remember that thanks to the transport and Internet layers, we don’t have to worry at all about the technical details we discussed in the previous chapters. When discussing the application layer, we can pretend that computers magically send formatted data to each other just as easily as you might talk to another person in the same room as you.
You might know good ol’ HTTP from your browser’s address bar, where it is often seen just ahead of the domain name. That’s because HTTP is the backbone of the World Wide Web: the interlinked multimedia web pages you view in your web browser.
Text
Before we dig into HTTP, we need a quick aside to discuss data formats. Recall that one thing that a protocol requires agreement on is a data format to use for communication. HTTP is a text-based protocol, meaning that much of its communication is in the form of human-readable text. But of course we need a binary format for storing that text in a computer. One such format is called ASCII.
ASCII is a simple text format where each byte represents a single character.
What does “character” mean here?
A character is a single textual symbol. For example, upper and lower case letters and punctuation symbols are all characters.
Here is a table translating between hexadecimal byte values and ASCII characters:
20 | 30 | 0 | 40 | @ | 50 | P | 60 | ` | 70 | p | |
---|---|---|---|---|---|---|---|---|---|---|---|
21 | ! | 31 | 1 | 41 | A | 51 | Q | 61 | a | 71 | q |
22 | " | 32 | 2 | 42 | B | 52 | R | 62 | b | 72 | r |
23 | # | 33 | 3 | 43 | C | 53 | S | 63 | c | 73 | s |
24 | $ | 34 | 4 | 44 | D | 54 | T | 64 | d | 74 | t |
25 | % | 35 | 5 | 45 | E | 55 | U | 65 | e | 75 | u |
26 | & | 36 | 6 | 46 | F | 56 | V | 66 | f | 76 | v |
27 | ' | 37 | 7 | 47 | G | 57 | W | 67 | g | 77 | w |
28 | ( | 38 | 8 | 48 | H | 58 | X | 68 | h | 78 | x |
29 | ) | 39 | 9 | 49 | I | 59 | Y | 69 | i | 79 | y |
2A | * | 3A | : | 4A | J | 5A | Z | 6A | j | 7A | z |
2B | + | 3B | ; | 4B | K | 5B | [ | 6B | k | 7B | { |
2C | , | 3C | < | 4C | L | 5C | \ | 6C | l | 7C | | |
2D | - | 3D | = | 4D | M | 5D | ] | 6D | m | 7D | } |
2E | . | 3E | > | 4E | N | 5E | ^ | 6E | n | 7E | ~ |
2F | / | 3F | ? | 4F | O | 5F | _ | 6F | o |
A few observations of this table:
- 0x20 translates to a space
- Digits are really easy to translate since 0–9 correspond to 0x30–0x39
- You can convert letters from upper case to lower case and vice versa by adding or subtracting 0x20
- There are a bunch of missing byte values: 0x00–0x1F and 0x7F–0xFF. The reason for some of these (0x80–0xFF) is because ASCII only uses the first 7 bits of each byte so only the first 27=128 byte values can be used. The other missing characters are “unprintable ASCII”. They include characters representing line breaks and indentation or even “control characters” that can have special meaning to the program using the ASCII text
You might recall from the chapter on data formats that one goal of a format is to identify the type of data to the computer. ASCII is such a simple format and it is understood so widely that it doesn’t bother with such things. Instead it is common for a computer to simply scan the bytes of data and, if they all fall within the ASCII range (less than 0x80), assume that the data are ASCII.
Those are the basics of ASCII. Again, you don’t need to worry about the details of ASCII as we move on. I just wanted to give you an idea of how computers handle all of the text we’ll be seeing later on.
HTML
Now that we know how computers read text, this opens up a world of text formats. Just like how a data format agrees on the meaning of binary data, a text format agrees on the meaning of text (which itself might be stored in a binary data format like ASCII).
HyperText Markup Language (HTML) is one such text format. The purpose of HTML is to enrich plain text with additional meaning. For example, consider this text:
The rare original heartsbleed goes,
Spends in the earthen hide, in the folds and wizenings, flows
In the gutters of the banked and staring eyes. He lies
As still as if he would return to stone,
Richard Wilbur, The Death of a Toad
From the context you can probably tell that this is a quotation, but computers aren’t so good at guessing such things. They like to have things all spelled out. Let’s mark up this text with HTML to make the meaning explicit.
I have added special text colors and styles to these sections to make the HTML easier to read.
It’s easy to spot the HTML parts because they are all wrapped in angled brackets
<like this>
. These bracketed bits are called “tags”. The tags we’ve added are
the bare minimum to identify this as an HTML document. Let’s examine each tag’s
meaning.
The !DOCTYPE
tag at the top lets the computer know that this is an HTML
document. Next is an <html>
tag. You will notice another similar tag at the
bottom: </html>
. The /
at the beginning of the tag tells us that these two
tags are a pair. This means that everything between <html>
and </html>
is
HTML. These two tags always wrap the contents of an HTML document. Next we see
another pair of tags: <body>
and </body>
. These tags enclose the body of our
text.
As I said, this is just the bare minimum. Let’s add interesting stuff.
Here we’ve identified the stanza of the poem as a paragraph using <p></p>
tags and we’ve added <br/>
tags at the end of each line to indicate line
breaks. The /
at the end of the br
tag indicates that each tag is on
its own and doesn’t have a matching </br>
later in the document. This teaches
us an important lesson about HTML.
HyperText Markup Language (HTML) is a language for describing the structure and meaning of text, with no regard to its appearance or presentation.
We humans understand the difference in meaning between line breaks in a paragraph and line breaks in a poem. We understand from context how the name following a quoted paragraph is not part of the quotation itself but a citation. HTML needs all of these implicit meanings to be made clear: line breaks are assumed to be meaningless unless specified with tags; text is assumed to be grouped together unless separated by tags.
A side effect of HTML being very explicit and ignoring line breaks and indentation is that we can use these tools to try to make HTML a little more readable. Notice how I use indentation to make it clearer where tags start and end.
HTML also provides tags for marking up quotations:
Now the association between the quotation and citation is clear.
You might have wondered earlier at the point of the <body>
tag. What isn’t
part of the body of text? Well, HTML provides another tag <head>
in which you
can place information about the document that isn’t part of the document
itself.
Now this is looking like a proper HTML document. But there’s one notable HTML tag which is missing:
We have added the mighty anchor tag or, as you probably know it, a
hyperlink. This tag looks a little different because it includes an attribute.
An attribute goes inside a start tag after the tag name and usually looks
something like key="value"
. Attributes let us describe additional information
about a particular instance of a tag.
In our case, the anchor “Richard Wilbur” has a “hypertext reference”
(href
) to a Wikipedia article.
Behind the Scenes
We’ve created a wonderful HTML document, but now what can we do with it? Well the real magic of HTML occurs when you give an HTML document to a web browser (like the one you’re using right now). The browser reads the document, interprets the various tags, and turns it into an interactive web page for you to browse. Check out the document we just made by clicking here.
Pretty cool, huh? To prove that there’s no trickery going on here, try right⌘-clicking on that page. The menu that pops up should have an option like “View page source” (this option may be difficult to find on a mobile device). This shows you exactly what HTML your browser is interpreting to create the page.
“Interpreting” is definitely the correct word to use here. In most web browsers, you will probably see “Richard Wilbur” underlined and colored in blue and the whole citation typeset in italics. But nowhere in our HTML does it say “make this blue and underlined”! Your browser has styled the HTML according to its interpretation in order to pass along the meaning of the tags to you.
One advantage of HTML is that it allows for alternative interpretations. For example a blind person might use a web browser that interprets HTML into a medium of touch and sound so that they can still interact with it.
Exercises
-
Translate the following bytes into text using the ASCII format:
55 73 69 6E 67 20 41 53 43 49 49 20 69 73 20 65 61 73 79 21
-
View the page source of this page; it’s written in HTML. Try to match up the HTML tags with what you actually see in your browser.
Well… I don’t actually write this book in HTML. I write it in a different textual language called Markdown. A program then turns my Markdown text into HTML. This is what that looks like.
-
Here is one of my favorite recipes written in plain text. Give it an HTML treatment like we did to the quotation above.
Pesto ===== This family recipe is simple, yet I have rarely tasted a restaurant's pesto that bested it. Ingredients ----------- * 2/3 cup basil leaves (approx.) * 1/3 cup olive oil * 1/3 cup parmesan cheese * 2 tbsp pine nuts or walnuts * 1/8 tsp (white) pepper * 2-3 cloves garlic Directions ---------- 1. Put all ingredients in blender 2. Blend Serves 4-6
There are a lot more tags to choose from than the few I showed you. Check out the full list of HTML tags here. Note that you don’t have to keep all of the text from the original if it doesn’t seem important to the meaning of the document.