Part 2: The Internet
Almost everyone uses the Internet these days: at home, at work, and even on the go with their cell phones. Consequently, most people have a vague idea of what the Internet is. But I think that many people would be surprised at how far from reality their vague ideas are.
For that reason, we are going to take a similar approach as we did for Part 1 and start by digging down deep to understand the fundamentals before building up to a complete picture. Let’s do it!
You probably are aware that the Internet lets your computer communicate with other computers around the world. Let’s forget about that complicated “around the world” part and focus just on “computers communicating”.
Protocols
We will first consider the simplest possible example of inter-computer communication: two computers connected with a cable. Each computer can send data across the cable using electrical signals and each computer can interpret electrical signals from the cable to get the data out again. Don’t worry about the technical details here, all this means is that each computer can send bytes to the other one.
But now we’re facing a problem similar to one we’ve faced before. If one computer receives a bunch of bytes from the other, how does it know what those bytes mean? This is just like the issue we faced when we were learning about binary data: bytes can mean very different things depending on how they are interpreted. The solution there was to agree on different formats for different data, but this can prove problematic for inter-computer communication.
Suppose one computer sends ten bytes to the other. Is that one ten byte message or two five byte messages? Normally the receiving computer would use the data format to determine this, but what if it doesn’t know what format we’re using?
The current list of digital data formats is huge and new formats are being invented all the time. It is quite common for one computer to understand a different set of formats than another one. If we simply sent formatted data across a wire to another computer that doesn’t know the format, it would have no idea how to read the data it receives.
Another problem that occurs with inter-computer communication is unreliable connections. For intra-computer communication—like transferring data from the hard drive to the RAM—we usually assume that the connection is perfect and the data will get to their destination intact. On the other hand, when data are being transmitted to other computers via cables or wireless signals, problems often occur. Cables can break and signals can get garbled. Even if the receiving computer understands the data format, it needs to know when a problem is occurring with the connection so that at the very least it knows that the data it gets might not be correct. Ideally when a problem occurs we would like to be able to try again until the data make it across successfully.
To solve these problems, we agree on certain rules for computer communications. This agreement has a special term:
A protocol defines the rules for exchanging messages between computers, including the data formats that will be used.
To get a handle on protocols, we’re going to build a protocol bit-by-bit. The purpose of the protocol will be to allow one computer to send another one some data while overcoming the various challenges of inter-computer communication:
- The receiving computer must understand what information is being sent
- The receiving computer must know when the information it received is incomplete
- The receiving computer must know when the information it received is incorrect
- It should be possible to send information even when the connection is unreliable
I don’t want to get bogged down in technical details, so I will frequently switch to analogy during this section. Let’s imagine the sending and receiving computers as two people: Alice and Bob, respectively. We can imagine the information as a copy of book that Alice wishes to send Bob.
Packaging
In our analogy, Alice would probably give the book to the postal service to deliver to Bob. This leads to the following rephrasing of our first challenge:
How can Alice give the book to the postal service?
If she simply puts the book in a mailbox, it will almost certainly be lost. She needs to provide information about who the book is for and where they live. But she can’t simply write this information on the book; how would the postal service know where to look for it? They might get confused and think that Bob is the name of publisher and end up mailing it to Stephen King.
The solution is to use a package. Let’s consider how this solves the problem:
- A package provides standard addressing. The postal service knows how to get the information it needs
- The package removes the need to know what exactly Alice is sending. If it is possible to send Bob a package, then it is also possible to send Bob anything that fits in a package—be it a book, photograph, CD, etc.
The computer equivalent of package provides these same benefits.
A packet is a data format that wraps other data formats. It provides information such as the amount of data being sent and where it is being sent to. A protocol must specify what packet format is used to transmit data.
Remember that with computers, Bob needs to know how to read any information that Alice sends him. By wrapping the data being sent in a single, standardized packet format, Alice can send Bob any kind of information at all using a single format. For example, a packet format might look like this:
- 4 bytes indicating the number of bytes in the next section:
- N bytes of data
Then Alice and Bob would use this format for all information exchanged between them. If Alice wants to send Bob a picture, she first adds these 4 bytes in front of the image format. Then Bob can receive that information without knowing anything about the image format Alice used—he gets everything he needs from the first 4 bytes.
Of course we probably want to have something like an address and return address in the packet format, so that the postal service knows where to deliver the package. But right now we’re considering the case of two computers connected only to each other, so we’ll ignore this for the time being.
From now on, whenever we talk about sending data between computers, we will assume that the data are enclosed in a packet.
ACK
Now Alice has packaged up her book and sent it to Bob. What is the next challenge? Well, if our postal service can make mistakes, it’s this:
How does Alice know that Bob received the package?
In our analogy, we’re assuming that the postal service is the only method for Alice and Bob to communicate (because this represents the connection between the computers). In other words, how can we use an unreliable connection (the postal service) to send something reliably?
Here is where the protocol really comes into play. Alice and Bob can agree ahead of time that if Bob receives a package from Alice, he will immediately reply with a package of his own telling Alice that he got the package. Now each person has a role to play:
- Alice sends Bob a copy of the book
- Every time Bob receives a package from Alice, he replies with a package of his own as confirmation
- If a week goes by without Alice getting the confirmation package from Bob, she sends him another copy of the book
Once Alice receives a confirmation package from Bob, she knows for sure that he received it. Of course, if the postal service is really crappy and frequently loses packages, this exchange could take quite a while—but it’s the only way to be sure. And if the postal service is slow and takes over a week to deliver packages, Alice might end up sending Bob several copies even though he’s receiving them without issue. But if Alice really wants Bob to get that book, it’s worth the extra effort.
With computers, these resends can occur several times per second and the messages are all sent and received at the speed of light. So even with a terrible connection, the delay can often be hard to notice. The standard terminology for a confirmation from Bob is an “acknowledgment” or “ACK” for short.
Checksums
With computers it is possible for a transmission to become garbled in transit—for bits in the data to be randomly changed. With our analogy, that would kind of be like the postal service accidentally opening the package, accidentally tearing out or changing some pages of the book, accidentally sealing it up again, and delivering it to Bob. Worst postal service ever, right?
How does Alice know that Bob received the correct copy of her book?
The protocol from the previous version doesn’t help here. If Bob receives a garbled version of the book, he will reply with an ACK and Alice will be none the wiser.
The solution here is that Alice can provide some additional information with the package. Suppose that Alice carefully counts the number of times the letter ‘e’ appears in her book and writes this number on the package. Then when Bob receives the book, he can also count the letter ‘e’ and see if it matches the number on the package. If the postal service randomly changes Alice’s book, it is unlikely that the numbers would still match.
In such a situation, Bob could send Alice an ACK along the lines of “I got your book, but those morons at the postal service accidentally screwed it up.” Alice can then immediately send Bob a new copy (and hope it doesn’t run into the same problem).
This double check would be agreed upon ahead of time as part of the protocol. The analogy isn’t very good here because the check depends on Alice sending Bob a book, but with computers we don’t have this problem. Remember that all digital information essentially amounts to numbers, so we can easily come up with a check that works for any packet—regardless of its contents. For example, simply interpret each byte of data as a non-negative integer and add them all up (ignoring overflow). Alice and Bob can both perform this check on the packet without knowing what is inside it.
A checksum is a function that takes an arbitrary number of bytes as input and outputs a fixed number of bytes as output and which is unlikely to produce the same output given two different inputs. Usually this function involves some kind of sum and is used for error checking.
The checksum value would be included in the packet format since it doesn’t depend on the kind of data being sent. Note that—like counting ‘e’s—the checksum doesn’t provide absolute certainty that the data are correct. A byte has 1 of 256 possible values, so if a message is randomly garbled there is a 1/256 (0.39%) chance that the checksum will still match. Those odds seem pretty good, but real computer communications usually use even more complicated checksums that have an even smaller chance of missing an error.
Again, if the connection is really bad, this can involve lots and lots of resends, but it’s still worth it. You would be surprised at how much damage a single erroneous bit can cause to a computer.
Disassembly
When it comes to errors in transmission between computers, it is uncommon that each and every byte transmitted has the same probability of being lost or garbled. Usually connections fluctuate between periods of high reliability and low reliability. For example a connection might be mostly reliable but occasionally lapse into unreliability for just a few seconds of each minute.
The analogy kind of breaks down here, so just think about how Alice’s book—formatted as bytes and wrapped in a packet—is transmitted to Bob. Each bit of the packet must be transmitted over the cable one at a time. The more bytes there are, the longer this process will take.
Suppose now that this connection is unreliable in a way that we expect around 2 seconds of each minute to be in a period of low reliability i.e. bytes sent during those 2 seconds have a high chance of being lost or garbled. If the book’s packet is so large that it takes a minute to transmit, we’re in serious trouble. Why? Since it takes a minute to transmit, we would expect that some of the bytes were sent during the 2 second period of unreliability and thus had a high chance of being lost or garbled. Since Bob performs the checksum on the entire packet, we would expect the checksum to almost always fail—unless we got really lucky and nothing bad happened during the period of high unreliability.
With our current protocol, we would need to resend the whole packet many, many times. Furthermore, these resends would be extremely wasteful since we would expect only 2 bytes of every 60 to be bad.
How can Alice decrease redundancy when retransmitting large packets?
What if Alice sends Bob the book one page at a time—each with its own checksum in its own packet? This provides some nice advantages:
- Pages sent during periods of high reliability are unaffected by pages sent during periods of low reliability
- Even if some pages don’t make it, Bob can start reading the book before he’s gotten the whole thing
- Once Bob receives a page with a valid checksum, Alice never has to resend that page
The protocol that Alice and Bob agree on would have to be a little more complex, however.
- Alice sends each page in its own packet
- Whenever Bob receives a package with a valid checksum, he sends Alice an ACK saying he got the page
- Alice must keep track of which pages Bob has ACKed, so she can resend the un-ACKed pages each week
- As he receives pages, Bob must reassemble them into the complete book using page numbers
However, this extra work is definitely worth it. Going back to the previous example where 2 seconds out of 60 are unreliable, we would expect Bob to receive the entire book ungarbled in less than two minutes.
While splitting up a book by its pages works only for books, digital data can be split up regardless of the data being sent. For example, we might use the following format for our packets:
- 1 byte for the packet number
- 1 byte for the checksum
- 14 bytes of data
Notice that we no longer specify the size of the packet—every packet is fixed at 16 bytes. How do we adapt our protocol to use this format? It requires a bit of preparation when Alice wants to send something:
- Alice counts the number of bytes of data being sent
- She then splits up the data into 14-byte chunks. If the size of the data isn’t divisible by 14, she can add 00 bytes to the end to make it so.
- Each 14-byte chunk of data is numbered in order starting from 1 and put in its own packet
- The number of bytes of the data (as originally recorded) is stored in the data for packet 0.
Then once Bob receives packet 0 he knows the total number of data bytes being sent to him. This tells him two things: the total number of packets he expects to receive (bytes ÷ 14 rounded up), and the number of 00 bytes that Alice added on to the end of the last packet (remainder of bytes ÷ 14). He then has everything he needs to reassemble the data once he gets all of the packets.
Exercises
- Both example packet formats in this chapter (in the Packaging and Disassembly sections) impose restrictions on the total number of bytes of data that can be sent. What are these limits?
- Consider a new situation where Alice and Bob want to have a conversation. Adapt the example protocol (or design a new one) which will allow them to do so. What new challenges does this situation pose?