「MIT 6.828」MIT 6.828 Fall 2018 lab6 PartB

Lab6 终于到最后一个part了

Posted by 许大仙 on November 2, 2022

这是最后的lab 6的最后一个part了!准备完结撒花。

Part B: Receiving packets and the web server

Receiving Packets

就像我们之前处理packets传输一样,我们还必须配置E1000来接收数据包(receive packets)并提供接收传输符队列(a receive descriptor queue)和多个接收描述符(receive descriptors)。

Section 3.2描述了如何让packet reception工作,包括了接收队列结构和接收描述符等信息。关于初始化的细节可以参考14.4。

Exercise 9. Read section 3.2. You can ignore anything about interrupts and checksum offloading (you can return to these sections if you decide to use these features later), and you don’t have to be concerned with the details of thresholds and how the card’s internal caches work.

这个练习的目的是熟悉Manual中Section 3.2和14.4。

Reading – Packet Reception

在一般情况下,数据包接收(Packet Reception)包括识别线路上是否存在数据包,执行地址过滤,将数据包存储在接收数据 FIFO 中,将数据传输到主机内存中的接收缓冲区,以及更新接收状态描述符。

  • Packet Address Filtering:硬件根据以下filter模式将传入的数据包存储在主机内存中。如果接收FIFO中没有足够的空间,硬件会丢弃它们并在适当的统计寄存器中指示丢失的数据包。E1000支持的过滤模式有:

  • Exact Unicast/Multicast:必须匹配 one of 16 stored addresses【可以是单播或广播地址】
  • Promiscuous Unicast:接收所有单播地址
  • Multicast:传入数据包的目标地址的高位索引一个位向量,指示是否接受该数据包; 如果向量中的位为 1,则接受该数据包,否则,拒绝它。 控制器提供4096位向量。 软件提供了四种可用于indexing的位的选择。它们是目标地址的内部存储表示的 [47:36]、[46:35]、[45:34] 或 [43:32]。
  • Promiscuous Multicast:接收所有广播地址
  • VLAN:接收所有VLAN的数据包,并具需要在VLAN过滤表中有适当的bit设置。详细解释在第9.3节中。
  • Normally, only good packets are received。 即那些被定义为no CRC error, symbol error, sequence error, length error, alignment error, or where carrier extension or receive errors are detected。但是,如果在设备控制寄存器(Device Control register)中设置了store-bad-packet位(RCTL.SBP),则通过过滤函数的坏包也将存储在主机内存中。 数据包错误由接收描述符中的错误位(RDESC.ERRORS)指示。 如果你想要接受所有的数据包,不论好坏,则需要设置混杂启用位 (RCTL.UPE/MPE) 和存储错误包位(RCTL.SBP)

  • Receive Data Storage:描述符指向的内存缓冲区存储数据包数据。 硬件支持七种接收缓冲区大小:

  • image-20221102212408383
  • 缓存区大小可以通过设置Receive Control register的RCTL.BSIZE & RCTL.BSEX来进行指示【See Section 13.4.22 for details.】
  • 以太网控制器对数据包缓冲区地址没有对齐限制( alignment restrictions)。This is desirable in situations where the receive buffer was allocated by higher layers in the networking software stack,因为这些较高层可能不知道特定以太网控制器的缓冲区对齐要求。 尽管对齐完全不受限制,但强烈建议软件尽可能在至少高速缓存行的边界上分配接收缓冲区。

  • Receive Descriptor Format:接收描述符是一种数据结构,其中包含接收数据缓冲区地址和用于硬件存储数据包信息的字段。表 3-1 列出的阴影区域表示在接收数据包时由硬件修改的字段。

    • image-20221102212733589
    • 收到以太网控制器的数据包后,硬件将数据包数据存储到指定的缓冲区中,并写入长度、数据包校验和、状态、错误和状态字段(the length, Packet Checksum, status, errors, and status fields)。长度涵盖写入接收缓冲区的数据,包括 CRC bytes(if any)。软件必须读取多个描述符以确定跨越多个接收缓冲区的数据包的完整长度。
    • 对于标准 802.3 数据包(非 VLAN),默认情况下,数据包校验和是在整个数据包上计算的,from the first byte of the DA through the last byte of the CRC, including the Ethernet and IP headers. 软件可以通过接收控制寄存器(Receive Control Register)修改数据包校验和计算的起始偏移量。 该寄存器在Section 13.4.22中描述。 要使用Packet Checksum验证 TCP 校验和,软件必须调整Packet Checksum value以回退不属于真正 TCP 校验和的字节.
    • 关于这里的Status Field中的详细信息参考Section 3.2.3.1:
      • image-20221102213413878
  • Receive Descriptor Fetching:描述符获取策略旨在支持 PCI 总线上的large bursts。这可以通过使用 64 个片上接收描述符(64 on-chip receive descriptors)和优化的获取算法(an optimized fetching algorithm)来实现。 获取算法尝试通过在每次burst获取一个缓存行或更多的描述符来充分利用 PCI 带宽。 详见Section 3.2.4。

    • When the on-chip buffer is nearly empty (RXDCTL.PTHRESH), a prefetch is performed whenever enough valid descriptors (RXDCTL.HTHRESH) are available in host memory and no other PCI activity of greater priority is pending (descriptor fetches and write-backs or packet data transfers).
    • image-20221102213814760
  • Receive Descriptor Write-Back:处理器的缓存行大小大于接收描述符大小(16 字节)。 因此,为每个接收到的数据包写回描述符信息将导致昂贵的部分高速缓存行(cache line)更新。 有两种机制可以最大限度地减少部分行回写的发生:

    • Receive descriptor packing(Section 3.2.5.1):打包成cache line单位,在某些条件触发时(如timeout或set write fresh bit)一起写
    • Null descriptor padding(Section 3.2.5.2):
  • Receive Descriptor Queue Structure:图 3-2 显示了接收描述符环形队列的结构。 硬件维护一个环形的描述符,并在推进头指针之前写回使用过的描述符。 when “size” descriptors have been processed.,头和尾指针回绕到base。

    • image-20221102214635063

    • HARDWARE OWNS ALL DESCRIPTORS BETWEEN [HEAD AND TAIL]

    • 软件通过将尾部指针写入最后一个有效描述符之外的条目索引来添加接收描述符。 当数据包到达时,它们被存储在内存中,并且头指针由硬件递增。 当头指针等于尾指针时,环形队列为空。 硬件停止在系统内存中存储数据包,直到软件推进尾指针,使更多的接收缓冲区可用。

    • 接收描述符的头和尾指针引用 16-byte的内存块。图中的阴影框表示已存储传入数据包但尚未被软件识别的描述符。软件可以通过读取内存中的描述符而不是通过I/O来确定接收缓冲区是否有效。任何具有非零status字节的描述符都已由硬件处理,并准备好由软件处理。

    • 关于接收描述符环形队列的各个寄存器信息,可以参考Page. 27

      image-20221102215310023

      image-20221102215342240

  • 剩下是关于接收初始化(Receive Initialization)的问题:
    • Program the Receive Address Register(s) (RAL/RAH) with the desired Ethernet addresses. RAL[0]/RAH[0] should always be used to store the Individual Ethernet MAC address of the Ethernet controller.
    • Initialize the MTA (Multicast Table Array) to 0b
    • Set the Receive Descriptor Length (RDLEN) register to the size (in bytes) of the descriptor ring. This register must be 128-byte aligned.
    • The Receive Descriptor Head and Tail(RDH/RDT) registers are initialized (by hardware) to 0b after a power-on or a software-initiated Ethernet controller reset.
    • Program the Receive Control (RCTL) register with appropriate values for desired operation。……【太多了,初始化的时候慢慢看就好,有对RCTL.EN,RCTL.LPE的设置等】

接收队列和传输队列非常相似,除了其一开始是被empty packet buffer填充的,然后等待着被incoming packets填充满。因此,如果网络是闲置的话,传输队列一般是空的(because all packets have been sent),但是此时接收队列是充满了empty packet buffers。

当E1000接收到数据包时,它首先检查它是否匹配该card已配置的过滤器(例如,to see if the packet is addressed to this E1000’s MAC address,即查看当前到达数据包是否符合E1000的MAC地址),如果数据包不匹配任何过滤器,则忽略了该数据包。

否则,E1000试图从接收队列的头部检索下一个接收描述符。如果头部(RDH)赶上了尾部(RDT),则接收队列用完了所有free描述符,因此card会drop数据包。如果有一个free的接收描述符出现,它将数据包数据复制到描述符指向的缓冲区中,设置描述符的DD(Donecriptor Done)和EOP(End of Packet)状态位,并递增RDH。

如果E1000收到的数据包比一个接收描述符中的数据包缓冲区大,则它将根据需要从接收队列中检索尽可能多的描述符,以存储数据包的整个内容。为了表明这已经发生了,它将在所有这些描述符上设置DD状态位,但仅在这些描述符的最后一个中设置EOP状态位。您可以在driver中处理这种可能,也可以简单地配置card以不接受“长数据包”(也称为 jumbo frames),并确保您的接收缓冲区足够大,可以存储最大可能的标准以太网数据包(1518个字节)。

Exercise 10. Set up the receive queue and configure the E1000 by following the process in section 14.4. You don’t have to support “long packets” or multicast. For now, don’t configure the card to use interrupts; you can change that later if you decide to use receive interrupts. Also, configure the E1000 to strip the Ethernet CRC, since the grade script expects it to be stripped.

By default, the card will filter out all packets. You have to configure the Receive Address Registers (RAL and RAH) with the card’s own MAC address in order to accept packets addressed to that card. You can simply hard-code QEMU’s default MAC address of 52:54:00:12:34:56 (we already hard-code this in lwIP, so doing it here too doesn’t make things any worse). Be very careful with the byte order; MAC addresses are written from lowest-order byte to highest-order byte, so 52:54:00:12 are the low-order 32 bits of the MAC address and 34:56 are the high-order 16 bits.

The E1000 only supports a specific set of receive buffer sizes (given in the description of RCTL.BSIZE in 13.4.22). If you make your receive packet buffers large enough and disable long packets, you won’t have to worry about packets spanning multiple receive buffers. Also, remember that, just like for transmit, the receive queue and the packet buffers must be contiguous in physical memory.

You should use at least 128 receive descriptors

现在开始Coding了,其实接收部分的Coding和发送部分也是基本上很相似的。

//kern/e1000.h
//For receive
#define E1000_ICS      0x000C8  /* Interrupt Cause Set - WO */
#define E1000_IMS      0x000D0  /* Interrupt Mask Set - RW */
#define E1000_RDBAL    0x02800  /* RX Descriptor Base Address Low - RW */
#define E1000_RDBAH    0x02804  /* RX Descriptor Base Address High - RW */
#define E1000_RDLEN    0x02808  /* RX Descriptor Length - RW */
#define E1000_RDH      0x02810  /* RX Descriptor Head - RW */
#define E1000_RDT      0x02818  /* RX Descriptor Tail - RW */
#define E1000_RDTR     0x02820  /* RX Delay Timer - RW */
#define E1000_RXDCTL   0x02828  /* RX Descriptor Control queue 0 - RW */
#define E1000_RCTL     0x00100  /* RX Control - RW */
/* these buffer sizes are valid if E1000_RCTL_BSEX is 0 */
#define E1000_RCTL_SZ_2048        0x00000000    /* rx buffer size 2048 */
#define E1000_RCTL_SZ_1024        0x00010000    /* rx buffer size 1024 */
#define E1000_RCTL_SZ_512         0x00020000    /* rx buffer size 512 */
#define E1000_RCTL_SZ_256         0x00030000    /* rx buffer size 256 */
/* Receive Control */
#define E1000_RCTL_RST            0x00000001    /* Software reset */
#define E1000_RCTL_EN             0x00000002    /* enable */
#define E1000_RCTL_BAM            0x00008000    /* broadcast enable */
#define E1000_RCTL_SECRC          0x04000000    /* Strip Ethernet CRC */
/* Receive Address */
#define E1000_RAH_AV  0x80000000        /* Receive descriptor valid */


#define RTXDESC     128
#define TX_BUF_SIZE 1518
#define RX_BUF_SIZE 1518

struct e1000_rx_desc{
    uint64_t addr;  //buffer addr
    uint16_t length;
    uint16_t chksum;

    uint8_t status;
    uint8_t err;
    uint16_t special;
}__attribute__((packed));

static struct e1000_rx_desc e1000_rx_queue[RTXDESC];
static char e1000_rx_buffer[RTXDESC][RX_BUF_SIZE];
static uint32_t E1000_MAC[6] = {0x52, 0x54, 0x00, 0x12, 0x34, 0x56};

然后我们实现e1000_receive_init函数,并在e1000_init函数中调用它进行接收初始化。

//kern/e1000.c

int e1000_init(struct pci_func *pcif){
    pci_func_enable(pcif);
    bar_va = mmio_map_region(pcif->reg_base[0], pcif->reg_size[0]);
    uint32_t * status_reg = (uint32_t *)E1000REG (E1000_STATUS);
    assert(*status_reg == 0x80080783);

    e1000_transmit_init();
    
    //char *data = "transmit test\0";
    //e1000_transmit(data, strlen(data));

    e1000_receive_init(); // 在原来的基础上增加了这一行
    return 0;
}

//MAC addresses are written from lowest-order byte to highest-order byte
//so 52:54:00:12 are the low-order 32 bits of the MAC address and 34:56 are the high-order 16 bits.
void e1000_set_mac_addr(uint32_t mac[]){
    uint32_t low = 0, high = 0;

    for(int i = 0; i < 4; i++){
        low |= mac[i] << (8 * i);
    }

    for(int i = 4; i < 6; i++){
        high |= mac[i] << (8 * i);
    }
    (*(uint32_t *)E1000REG(E1000_RA)) = low;
    (*((uint32_t *)E1000REG(E1000_RA) + 1)) = high | E1000_RAH_AV;
}

//for receive
static void e1000_receive_init(){
    memset(e1000_rx_queue, 0, sizeof(e1000_rx_queue));
    memset(e1000_rx_buffer, 0, sizeof(e1000_rx_buffer));

    for(size_t i = 0; i < RTXDESC; i++){
        e1000_rx_queue[i].addr = PADDR(e1000_rx_buffer[i]);
    }

    (*(uint32_t *)E1000REG(E1000_ICS)) = 0;
    (*(uint32_t *)E1000REG(E1000_IMS)) = 0;
    (*(uint32_t *)E1000REG(E1000_RDBAL)) = PADDR(e1000_rx_queue);
    (*(uint32_t *)E1000REG(E1000_RDBAH)) = 0;
    (*(uint32_t *)E1000REG(E1000_RDLEN)) = sizeof(struct e1000_rx_desc) * RTXDESC; //  or = sizeof(e1000_rx_queue)

    (*(uint32_t *)E1000REG(E1000_RDT)) = RTXDESC - 1;
    (*(uint32_t *)E1000REG(E1000_RDH)) = 0;
    
	// Receive control register
	// 1. disable long packet
	// 2. 2048 buffer size, default
	// 3. strip CRC
	// 4. broadcast enabled
    (*(uint32_t *)E1000REG(E1000_RCTL)) = E1000_RCTL_EN | E1000_RCTL_BAM | E1000_RCTL_SECRC; // | E1000_RCTL_SZ_2048

    e1000_set_mac_addr(E1000_MAC);
}

You can do a basic test of receive functionality now, even without writing the code to receive packets.

Run make E1000_DEBUG=TX,TXERR,RX,RXERR,RXFILTER run-net_testinput. testinput will transmit an ARP (Address Resolution Protocol) announcement packet (using your packet transmitting system call), which QEMU will automatically reply to.

Even though your driver can’t receive this reply yet, you should see a e1000: unicast match[0]: 52:54:00:12:34:56 message, indicating that a packet was received by the E1000 and matched the configured receive filter. If you see a e1000: unicast mismatch: 52:54:00:12:34:56 message instead, the E1000 filtered out the packet, which means you probably didn’t configure RAL and RAH correctly.

Make sure you got the byte ordering right and didn't forget to set the "Address Valid" bit in RAH. If you don’t get any “e1000” messages, you probably didn’t enable receive correctly.

image-20221102224704117

以上完成了接收初始化之后,现在准备处理接收数据包的逻辑。为了接收一个数据包,我们的驱动需要跟踪哪个descriptor应该hold下一个接收到的packets (hint: depending on your design, there’s probably already a register in the E1000 keeping track of this). 类似于传输,文档标注了RDH寄存器(/* RX Descriptor Head - RW */)无法可靠地从软件中读取,因此为了决定哪一个数据包会被传送到当前descriptor的packet buffer,我们需要读取这个描述符的DD状态位。如果DD状态位被设置了,我们可以从描述符的packet buffer中拷贝出packet data并且通过更新描述符队列的tail index(RDT)来告诉card:当前的描述符已经free了,可以被card使用了。

但是如果DD状态位没有设置,那么就没有数据包被接收。这相当于“传输队列已满时”的接收端情况【不能再接收新到达的数据包了】,在这种情况下我们可能有几种方式来进行处理:比如我们可以简单地返回“try again”错误并要求调用者重试。虽然这种方法能适用于满了的传输队列,因为这是一种转瞬即逝的情况,但对于空的接收队列来说不太合理,因为接收队列可能会在很长一段时间内保持为空(是个长期情况,让用户一直重试接收,不大好)。第二种方法是suspend calling environment,直到接收队列中有要处理的数据包(准确来说,其实就是阻塞)。这种策略与sys_ipc_recv非常相似。就像在 IPC case中一样,由于每个 CPU 只有一个kernel stack,所以一旦我们离开内核,stack上的状态就会丢失。我们需要设置一个标志,指示an environment has been suspended by receive queue,并记录系统调用参数。这种方法的缺点就是复杂性高:必须指示E1000生成接收中断(receive interrupts ),并且驱动程序必须处理它们才能恢复阻塞的那个等待数据包的环境。

Exercise 11. Write a function to receive a packet from the E1000 and expose it to user space by adding a system call. Make sure you handle the receive queue being empty.

receive 的实现,最重要的一点是理解硬件接收数据包的过程:当硬件接收到数据包时,首先会进行一次过滤(对比MAC地址等),若符合接收标准,硬件会将数据包存储到我们分配的 buffer中,并同时设置描述符的DD位已经执行RDH加1操作。 所以当我们编写receive 函数时,可以选择定义一个 static 变量,用来指向第一个可接收的描述符。要记得特殊处理receive queue为空的情况。

现在完成接收数据包函数:

//kern/e1000.h
/* Receive Descriptor bit definitions */
#define E1000_RXD_STAT_DD       0x01    /* Descriptor Done */
#define E1000_RXD_STAT_EOP      0x02    /* End of Packet */

#define RTXDESC     128
#define TX_BUF_SIZE 1518
#define RX_BUF_SIZE 1518
#define E_RECEIVE_RETRY 1

int e1000_transmit(void *data, size_t len); 
int e1000_receive(void *buf, size_t *len); //define

//kern/e1000.c
int e1000_receive(void *buf, size_t *len){
    static size_t next = 0; // attention! this is a static variable, which can be initialized once.
    size_t tail = (*(uint32_t *)E1000REG(E1000_RDT));
    if(!(e1000_rx_queue[next].status & E1000_RXD_STAT_DD)) {
        return -E_RECEIVE_RETRY;
    }
    *len = e1000_rx_queue[next].length;
    memcpy(buf, e1000_rx_buffer[next], *len);

    // unset DD status bit
    e1000_rx_queue[next].status &= ~E1000_RXD_STAT_DD;
    next = (next + 1) % RTXDESC;
    (*(uint32_t *)E1000REG(E1000_RDT)) = (tail + 1) % RTXDESC;
    return 0;
}

并添加对应系统调用:

//inc/lib.h	
/* …… */
int	sys_ipc_recv(void *rcv_pg);
unsigned int sys_time_msec(void);
int sys_packet_try_send(void *data, size_t len);
int sys_packet_try_receive(void *addr, size_t *len);
//inc/syscall.h
enum {
    /* …… */
	SYS_time_msec,
	SYS_packet_try_send,
	SYS_packet_try_receive,
	NSYSCALLS
};
//kern/syscall.c
static int sys_packet_try_receive(void *addr, uint32_t * len){
	return e1000_receive(addr, len);
}
// Dispatches to the correct kernel function, passing the arguments.
int32_t
syscall(uint32_t syscallno, uint32_t a1, uint32_t a2, uint32_t a3, uint32_t a4, uint32_t a5)
{
	switch (syscallno)
	{
        /* …… */
        case SYS_packet_try_receive:
            return sys_packet_try_receive((void *)a1, (uint32_t *)a2);
    }
}

//lib/syscall.c
int sys_packet_try_receive(void * addr, size_t *len){
	return syscall(SYS_packet_try_receive, 1, (uint32_t)addr, (uint32_t)len, 0, 0, 0);
}

Challenge! If the transmit queue is full or the receive queue is empty, the environment and your driver may spend a significant amount of CPU cycles polling, waiting for a descriptor. The E1000 can generate an interrupt once it is finished with a transmit or receive descriptor【就不是轮询模式,而是完成后通知模式】, avoiding the need for polling. Modify your driver so that processing the both the transmit and receive queues is interrupt driven instead of polling.

Note that, once an interrupt is asserted, it will remain asserted until the driver clears the interrupt. In your interrupt handler make sure to clear the interrupt as soon as you handle it. If you don’t, after returning from your interrupt handler, the CPU will jump back into it again. In addition to clearing the interrupts on the E1000 card, interrupts also need to be cleared on the LAPIC. Use lapic_eoi to do so.

Receiving Packets: Network Server

在network server input environment中,我们将会使用新设计的receive system call sys_packet_try_receive来接收数据包,并且把他们通过NSREQ_INPUT IPC信息传送给core network server environment。这些IPC输入信息应该有一个attached了union Nsipc的页面,其struct jif_pkt pkt字段填充了从网络接收的数据包。

Network server architecture

Exercise 12. Implement net/input.c.

现在我们来实现它:

#include "ns.h"
#include "kern/e1000.h"
#include "inc/lib.h"

extern union Nsipc nsipcbuf;

void sleep(int msec){
	unsigned now = sys_time_msec();
	unsigned end = now + msec;
	if((int)now < 0 && (int)now > -MAXERROR){
		panic("sys_time_msec: %e", (int)now);
	}
	while(sys_time_msec() < end)
		sys_yield();
}

void
input(envid_t ns_envid)
{
	binaryname = "ns_input";

	// LAB 6: Your code here:
	// 	- read a packet from the device driver
	//	- send it to the network server
	// Hint: When you IPC a page to the network server, it will be
	// reading from it for a while, so don't immediately receive
	// another packet in to the same physical page.
	size_t len;
	char rev_buf[2048];
	while (1)
	{
		if(sys_packet_try_receive(rev_buf, &len) < 0){
			continue;
		}
		memcpy(nsipcbuf.pkt.jp_data, rev_buf, len);
		nsipcbuf.pkt.jp_len = len;
		ipc_send(ns_envid, NSREQ_INPUT, &nsipcbuf, PTE_P | PTE_U | PTE_W);
		sleep(50);	//注意这个函数的hint说过,由于network server接收需要时间,所以需要sleep之后再考虑接收下一个数据包。
	}
	
}

Run testinput again with make E1000_DEBUG=TX,TXERR,RX,RXERR,RXFILTER run-net_testinput. You should see

Sending ARP announcement...
Waiting for packets...
e1000: index 0: 0x26dea0 : 900002a 0
e1000: unicast match[0]: 52:54:00:12:34:56
input: 0000   5254 0012 3456 5255  0a00 0202 0806 0001
input: 0010   0800 0604 0002 5255  0a00 0202 0a00 0202
input: 0020   5254 0012 3456 0a00  020f 0000 0000 0000
input: 0030   0000 0000 0000 0000  0000 0000 0000 0000

input:开头的行是hexdump of QEMU’s ARP reply.

Your code should pass the testinput tests of make grade. 请注意,如果不发送至少一个 ARP 数据包以通知QEMU虚拟机我们的JOS IP地址,则无法测试数据包接收, so bugs in your transmitting code can cause this test to fail.

为了更彻底地测试您的网络代码,我们提供了一个名为 echosrv 的守护进程,它设置了一个运行在端口 7 上的回显服务器(echo server),它将回显(echo back)通过 TCP 连接发送的任何内容。. Use make E1000_DEBUG=TX,TXERR,RX,RXERR,RXFILTER run-echosrv to start the echo server in one terminal and make nc-7 in another to connect to it. Every line you type should be echoed back by the server. Every time the emulated E1000 receives a packet, QEMU should print something like the following to the console:

e1000: unicast match[0]: 52:54:00:12:34:56
e1000: index 2: 0x26ea7c : 9000036 0
e1000: index 3: 0x26f06a : 9000039 0
e1000: unicast match[0]: 52:54:00:12:34:56

At this point, you should also be able to pass the echosrv test.

image-20221104135515955

  • Challenge! Read about the EEPROM in the developer’s manual and write the code to load the E1000’s MAC address out of the EEPROM. Currently, QEMU’s default MAC address is hard-coded into both your receive initialization and lwIP. Fix your initialization to use the MAC address you read from the EEPROM, add a system call to pass the MAC address to lwIP, and modify lwIP to the MAC address read from the card. Test your change by configuring QEMU to use a different MAC address.
  • Challenge!* Modify your E1000 driver to be “zero copy.” Currently, packet data has to be copied from user-space buffers to transmit packet buffers and from receive packet buffers back to user-space buffers. A zero copy driver avoids this by having user space and the E1000 share packet buffer memory directly. There are many different approaches to this, including mapping the kernel-allocated structures into user space or passing user-provided buffers directly to the E1000. Regardless of your approach, be careful how you reuse buffers so that you don’t introduce races between user-space code and the E1000.
  • Challenge! Take the zero copy concept all the way into lwIP.
  • A typical packet is composed of many headers. The user sends data to be transmitted to lwIP in one buffer. The TCP layer wants to add a TCP header, the IP layer an IP header and the MAC layer an Ethernet header. Even though there are many parts to a packet, right now the parts need to be joined together so that the device driver can send the final packet.
  • The E1000’s transmit descriptor design is well-suited to collecting pieces of a packet scattered throughout memory, like the packet fragments created inside lwIP. If you enqueue multiple transmit descriptors, but only set the EOP command bit on the last one, then the E1000 will internally concatenate the packet buffers from these descriptors and only transmit the concatenated buffer when it reaches the EOP-marked descriptor. As a result, the individual packet pieces never need to be joined together in memory.
  • Change your driver to be able to send packets composed of many buffers without copying and modify lwIP to avoid merging the packet pieces as it does right now.
  • Challenge! Augment your system call interface to service more than one user environment. This will prove useful if there are multiple network stacks (and multiple network servers) each with their own IP address running in user mode. The receive system call will need to decide to which environment it needs to forward each incoming packet.
  • Note that the current interface cannot tell the difference between two packets and if multiple environments call the packet receive system call, each respective environment will get a subset of the incoming packets and that subset may include packets that are not destined to the calling environment.
  • Sections 2.2 and 3 in this Exokernel paper have an in-depth explanation of the problem and a method of addressing it in a kernel like JOS. Use the paper to help you get a grip on the problem, chances are you do not need a solution as complex as presented in the paper.

The Web Server

A web server in its simplest form sends the contents of a file to the requesting client. 目前实验已经在user/httpd.c提供了一个web server的skeleton code。这个skeleton code处理了incoming connections并且进行了headers的解析。

Exercise 13. The web server is missing the code that deals with sending the contents of a file back to the client. Finish the web server by implementing send_file and send_data.

我们首先可以来看看httpd实现的源码:

#include <inc/lib.h>
#include <lwip/sockets.h>
#include <lwip/inet.h>

#define PORT 80
#define VERSION "0.1"
#define HTTP_VERSION "1.0"

#define E_BAD_REQ	1000

#define BUFFSIZE 512
#define MAXPENDING 5	// Max connection requests

struct http_request {
	int sock;
	char *url;
	char *version;
};

struct responce_header {
	int code;
	char *header;
};

struct responce_header headers[] = {
	{ 200, 	"HTTP/" HTTP_VERSION " 200 OK\r\n"
		"Server: jhttpd/" VERSION "\r\n"},
	{0, 0},
};

struct error_messages {
	int code;
	char *msg;
};

struct error_messages errors[] = {
	{400, "Bad Request"},
	{404, "Not Found"},
};

static void
die(char *m)
{
	cprintf("%s\n", m);
	exit();
}

static void
req_free(struct http_request *req)
{
	free(req->url);
	free(req->version);
}

static int
send_header(struct http_request *req, int code)
{
	struct responce_header *h = headers;
	while (h->code != 0 && h->header != 0) {
		if (h->code == code)
			break;
		h++;
	}

	if (h->code == 0)
		return -1;

	int len = strlen(h->header);
	if (write(req->sock, h->header, len) != len) {
		die("Failed to send bytes to client");
	}

	return 0;
}

static int
send_data(struct http_request *req, int fd)
{
	// LAB 6: Your code here.
	panic("send_data not implemented");
}

static int
send_size(struct http_request *req, off_t size)
{
	char buf[64];
	int r;

	r = snprintf(buf, 64, "Content-Length: %ld\r\n", (long)size);
	if (r > 63)
		panic("buffer too small!");

	if (write(req->sock, buf, r) != r)
		return -1;

	return 0;
}

static const char*
mime_type(const char *file)
{
	//TODO: for now only a single mime type
	return "text/html";
}

static int
send_content_type(struct http_request *req)
{
	char buf[128];
	int r;
	const char *type;

	type = mime_type(req->url);
	if (!type)
		return -1;

	r = snprintf(buf, 128, "Content-Type: %s\r\n", type);
	if (r > 127)
		panic("buffer too small!");

	if (write(req->sock, buf, r) != r)
		return -1;

	return 0;
}

static int
send_header_fin(struct http_request *req)
{
	const char *fin = "\r\n";
	int fin_len = strlen(fin);

	if (write(req->sock, fin, fin_len) != fin_len)
		return -1;

	return 0;
}

// given a request, this function creates a struct http_request
static int
http_request_parse(struct http_request *req, char *request)
{
	const char *url;
	const char *version;
	int url_len, version_len;

	if (!req)
		return -1;

	if (strncmp(request, "GET ", 4) != 0)
		return -E_BAD_REQ;

	// skip GET
	request += 4;

	// get the url
	url = request;
	while (*request && *request != ' ')
		request++;
	url_len = request - url;

	req->url = malloc(url_len + 1);
	memmove(req->url, url, url_len);
	req->url[url_len] = '\0';

	// skip space
	request++;

	version = request;
	while (*request && *request != '\n')
		request++;
	version_len = request - version;

	req->version = malloc(version_len + 1);
	memmove(req->version, version, version_len);
	req->version[version_len] = '\0';

	// no entity parsing

	return 0;
}

static int
send_error(struct http_request *req, int code)
{
	char buf[512];
	int r;

	struct error_messages *e = errors;
	while (e->code != 0 && e->msg != 0) {
		if (e->code == code)
			break;
		e++;
	}

	if (e->code == 0)
		return -1;

	r = snprintf(buf, 512, "HTTP/" HTTP_VERSION" %d %s\r\n"
			       "Server: jhttpd/" VERSION "\r\n"
			       "Connection: close"
			       "Content-type: text/html\r\n"
			       "\r\n"
			       "<html><body><p>%d - %s</p></body></html>\r\n",
			       e->code, e->msg, e->code, e->msg);

	if (write(req->sock, buf, r) != r)
		return -1;

	return 0;
}

static int
send_file(struct http_request *req)
{
	int r;
	off_t file_size = -1;
	int fd;

	// open the requested url for reading
	// if the file does not exist, send a 404 error using send_error
	// if the file is a directory, send a 404 error using send_error
	// set file_size to the size of the file

	// LAB 6: Your code here.
	panic("send_file not implemented");

	if ((r = send_header(req, 200)) < 0)
		goto end;

	if ((r = send_size(req, file_size)) < 0)
		goto end;

	if ((r = send_content_type(req)) < 0)
		goto end;

	if ((r = send_header_fin(req)) < 0)
		goto end;

	r = send_data(req, fd);

end:
	close(fd);
	return r;
}

static void
handle_client(int sock)
{
	struct http_request con_d;
	int r;
	char buffer[BUFFSIZE];
	int received = -1;
	struct http_request *req = &con_d;

	while (1)
	{
		// Receive message
		if ((received = read(sock, buffer, BUFFSIZE)) < 0)
			panic("failed to read");

		memset(req, 0, sizeof(*req));

		req->sock = sock;

		r = http_request_parse(req, buffer); //填充req的url和version
		if (r == -E_BAD_REQ)
			send_error(req, 400);
		else if (r < 0)
			panic("parse failed");
		else
			send_file(req);

		req_free(req);

		// no keep alive
		break;
	}

	close(sock);
}

void
umain(int argc, char **argv)
{
	int serversock, clientsock;
	struct sockaddr_in server, client;

	binaryname = "jhttpd";

	// Create the TCP socket
	if ((serversock = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0)
		die("Failed to create socket");

	// Construct the server sockaddr_in structure
	memset(&server, 0, sizeof(server));		// Clear struct
	server.sin_family = AF_INET;			// Internet/IP
	server.sin_addr.s_addr = htonl(INADDR_ANY);	// IP address
	server.sin_port = htons(PORT);			// server port

	// Bind the server socket
	if (bind(serversock, (struct sockaddr *) &server,
		 sizeof(server)) < 0)
	{
		die("Failed to bind the server socket");
	}

	// Listen on the server socket
	if (listen(serversock, MAXPENDING) < 0)
		die("Failed to listen on server socket");

	cprintf("Waiting for http connections...\n");

	while (1) {
		unsigned int clientlen = sizeof(client);
		// Wait for client connection
		if ((clientsock = accept(serversock,
					 (struct sockaddr *) &client,
					 &clientlen)) < 0)
		{
			die("Failed to accept client connection");
		}
		handle_client(clientsock);
	}

	close(serversock);
}

通过阅读源码我们可以知道,httpd其实要做的就是对底层的message解析,将data buffer按照http的格式划分好各个字段,然后处理socket操作(bind,listen,accept)等。

对于send_file:

static int
send_file(struct http_request *req)
{
	int r;
	off_t file_size = -1;
	int fd;

	// open the requested url for reading
	// if the file does not exist, send a 404 error using send_error
	// if the file is a directory, send a 404 error using send_error
	// set file_size to the size of the file

	// LAB 6: Your code here.
	if((fd = open(req->url, O_RDONLY)) < 0){
		send_error(req, 404);
		goto end;
	}
    
	struct Stat stat;
	fstat(fd, &stat);
	if(stat.st_isdir){
		send_error(req, 404);
		goto end;
	}

	if ((r = send_header(req, 200)) < 0)
		goto end;

	if ((r = send_size(req, file_size)) < 0)
		goto end;

	if ((r = send_content_type(req)) < 0)
		goto end;

	if ((r = send_header_fin(req)) < 0)
		goto end;

	r = send_data(req, fd);

end:
	close(fd);
	return r;
}

然后处理send_data:


static int
send_data(struct http_request *req, int fd)
{
	// LAB 6: Your code here.
	//panic("send_data not implemented");
	struct Stat stat;
	fstat(fd, &stat);
	void *buf = malloc(stat.st_size);
	if(readn(fd, buf, stat.st_size) != stat.st_size){
		panic("failed to read request file\n");
	}
	//write to socket
	if(write(req->sock, buf, stat.st_size) != stat.st_size){
		panic("failed to send bytes to client");
	}
	free(buf);
	buf = NULL;
	return 0;
}

最后 run make run-httpd-nox,然后在虚拟机的浏览器中输入http://localhost:26002,浏览器会显示404, 然后输入http://localhost:25002/index.html,Web将会返回内容cheesy web page

image-20221104150229147

Once you’ve finished the web server, start the webserver (make run-httpd-nox) and point your favorite browser at http://host:port/index.html, where host is the name of the computer running QEMU (If you’re running QEMU on athena use hostname.mit.edu (hostname is the output of the hostname command on athena, or localhost if you’re running the web browser and QEMU on the same computer) and port is the port number reported for the web server by make which-ports. You should see a web page served by the HTTP server running inside JOS.

image-20221104150421173

At this point, you should score 105/105 on make grade.

Challenge! Add a simple chat server to JOS, where multiple people can connect to the server and anything that any user types is transmitted to the other users. To do this, you will have to find a way to communicate with multiple sockets at once and to send and receive on the same socket at the same time. There are multiple ways to go about this. lwIP provides a MSG_DONTWAIT flag for recv (see lwip_recvfrom in net/lwip/api/sockets.c), so you could constantly loop through all open sockets, polling them for data. Note that, while recv flags are supported by the network server IPC, they aren’t accessible via the regular read function, so you’ll need a way to pass the flags. A more efficient approach is to start one or more environments for each connection and to use IPC to coordinate them. Conveniently, the lwIP socket ID found in the struct Fd for a socket is global (not per-environment), so, for example, the child of a fork inherits its parents sockets. Or, an environment can even send on another environment’s socket simply by constructing an Fd containing the right socket ID.

这里有一个需要注意的地方,之前实现的关于receive和transmit的syscall是没有在httpd中直接被调用的,而是通过socket的read和write调用:

write(req->sock, buf, r) -> (*dev->dev_write)(fd, buf, n);
// lib/sockets.c
struct Dev devsock =
{
	.dev_id =	's',
	.dev_name =	"sock",
	.dev_read =	devsock_read,
	.dev_write =	devsock_write,
	.dev_close =	devsock_close,
	.dev_stat =	devsock_stat,
};

static ssize_t
devsock_write(struct Fd *fd, const void *buf, size_t n)
{
	return nsipc_send(fd->fd_sock.sockid, buf, n, 0);
}

//lib/nsipc.c
int
nsipc_send(int s, const void *buf, int size, unsigned int flags)
{
	nsipcbuf.send.req_s = s;
	assert(size < 1600);
	memmove(&nsipcbuf.send.req_buf, buf, size);
	nsipcbuf.send.req_size = size;
	nsipcbuf.send.req_flags = flags;
	return nsipc(NSREQ_SEND);
}

static int
nsipc(unsigned type)
{
	static envid_t nsenv;
	if (nsenv == 0)
		nsenv = ipc_find_env(ENV_TYPE_NS);

	static_assert(sizeof(nsipcbuf) == PGSIZE);

	if (debug)
		cprintf("[%08x] nsipc %d\n", thisenv->env_id, type);

	ipc_send(nsenv, type, &nsipcbuf, PTE_P|PTE_W|PTE_U);
	return ipc_recv(NULL, NULL, NULL);
}

可以看到user space对socket的write最终就会触发NSREQ_SEND类型ipc_send,从而进入到core network server【如图绿色的httpd是通过sockets over IPC进入到network server的】,最终就会触发到之前实现的e1000 driver了。

Network server architecture

具体我们可以参考如下的流程:

发送包流程

到这里就撒花完结了。:hibiscus:

Reference

推荐一个进阶版的part:MIT6.S081 Operating System Engineering

数据链路层(Data Link Layer)是OSI参考模型第二层,位于物理层与网络层之间。在广播式多路访问链路中(局域网),由于可能存在介质争用,它还可以细分成介质访问控制(MAC)子层和逻辑链路控制(LLC)子层,介质访问控制(MAC)子层专职处理介质访问的争用与冲突问题。

PHY: Port Physical Layer,即OSI模型中的物理层。PHY連接一個數據鏈路層的設備(MAC)到一個物理媒介,如光纖或銅纜線。典型的PHY包括PCS(Physical Coding Sublayer,物理編碼子層)和PMD(Physical Media Dependent,物理介質相關子層)。PCS對被傳送和接受的資訊加碼和解碼,目的是使接收器更容易恢復信號。