Linux on chengzhycn's blog

Linux 收包和发包流程

Thu, 13 Oct 2022 14:30:28 +0800

流程图

From 《Understanding Linux Network Internals》

收包流程

TL; DR

收包NET_RX_SOFTRQ的软中断处理函数是net_rx_action
net_rx_action中会调用网卡驱动注册的poll回调函数处理
poll回调函数将数据帧从网卡ring buffer中取出，构造skb：
- 运行xdpdrv上的bpf program，得到action result
- 如果是XDP_PASS，构造skb，并初始化skb中一些metadata字段
调用内核的GRO和RPS处理流程
进入__netif_receive_skb_core，处理skb：
- 运行xdpgeneric上的bpf program，得到action result
- 如果是XDP_PASS，遍历ptype_all和dev->ptype_all，进行抓包处理
- tc ingress 处理sch_handle_ingress
- 查找ptype_base和dev->ptype_specific，交由对应的三层协议栈回调函数处理

NAPI

**NAPI的思想是从完全的中断收包模型，改用中断和polling混合。**如果内核在处理旧的数据帧时，收到了新的数据帧，网卡设备没有必要再触发一个中断。内核继续处理设备input queue里的数据（该设备的interrupt禁止了），在队列为空时重新使能中断。

从内核的角度，NAPI有如下的优势：

降低CPU的负载（更少的中断）
更多的设备处理公平性

以ixgbe网卡为例，描述下NAPI处理流程。

注册

ixgbe驱动在初始化中断向量时会调用netif_napi_add初始化NAPI，==将ixgbe_poll函数注册到napi结构体，并将napi加入到设备的napi_list内==：

/**
 * ixgbe_alloc_q_vector - Allocate memory for a single interrupt vector
 * @adapter: board private structure to initialize
 * @v_count: q_vectors allocated on adapter, used for ring interleaving
 * @v_idx: index of vector in adapter struct
 * @txr_count: total number of Tx rings to allocate
 * @txr_idx: index of first Tx ring to allocate
 * @xdp_count: total number of XDP rings to allocate
 * @xdp_idx: index of first XDP ring to allocate
 * @rxr_count: total number of Rx rings to allocate
 * @rxr_idx: index of first Rx ring to allocate
 *
 * We allocate one q_vector.  If allocation fails we return -ENOMEM.
 **/
static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
                int v_count, int v_idx,
                int txr_count, int txr_idx,
                int xdp_count, int xdp_idx,
                int rxr_count, int rxr_idx)
{
	/* ... */

    /* initialize NAPI */
    netif_napi_add(adapter->netdev, &q_vector->napi,
               ixgbe_poll, 64);
}

中断处理函数

ixgbe驱动收到中断后，会调用ixgbe_msix_clean_rings

Linux 进程调度

Tue, 11 Jan 2022 14:30:28 +0800

总体来说，调度主要解决两类问题：

什么时候调度
调度的目标是谁而一个优秀的调度系统，可以从如下几个指标判断：
fast process response time
good throughput for background jobs
avoidance of process starvation
reconciliation of the needs of low- and high- priority processes

Linux调度基于分时（Time-Sharing）：多个进程共享同一个CPU，CPU运行时切分成多个时间片。分时技术依赖于timer interrupt，对进程来说是透明的。

Linux进程是可抢占的。当进程切换到TASK_RUNNING状态，内核检查其动态优先级是否比当前运行的进程的优先级高。如果更高，当前执行的进程中断掉，调度器切入，选择另外一个进程运行（通常是刚刚切换成runnable的进程）。当进程的时间片消耗完毕时，进程也是可被抢占的。当前进程struct thread_info中的TIF_NEED_RESCHED被设置，timer interrupt handler终止后，scheduler切入。

一个被抢占的进程并不是被暂停，它还维持在TASK_RUNNING状态，只不过不再使用CPU。

调度算法

Linux 2.6版本之前的调度算法很简单，进程切换时，内核扫描runnable进程列表，计算它们的优先级，选择“最优”的进程运行。这种算法的问题是切换时的时间开销会随着进程数目的增多呈线性增长。

调度策略：

SCHED_FIFO
SCHED_RR
SCHED_NORMAL

O(1) Scheduler

基于Linux-2.6.12版本。

优先级

Linux 2.6.12 内核里，优先级表示分为用户态和内核态。用户态优先级就是我们常用的nice值。取值范围[-20, 19]。内核态优先级被称作静态优先级（static priority），取值范围是[0, 139]。其中，[0, 99]是Real-Time Process，[100, 139]对应用户态nice值。

在静态优先级外，真正发挥作用的是动态优先级（dynamic priority）。动态优先级实际上是静态优先级加上了一个运行时补偿（bonus）： $$ dynamic_priority = max(100, min((static_priority - 5 + bonus), 139)), bonus \in [0, 10] $$

Linux 进程

Tue, 30 Nov 2021 08:30:28 +0800

进程和轻量级进程

在Linux内核中，进程/线程对应的数据结构是task_struct，定义在include/linux/sched.h中。

线程在Linux中的实现是 Naive POSIX Thread Library 。在内核眼中，Linux的线程实际上也是一个进程（task_struct），区别是线程的“进程”共享了地址空间、文件描述符等，称作==轻量级进程==。因此，Linux的线程也是独立的调度单元，是可以分别在不同的CPU上同时运行的。原生的Linux 线程（2.6版本内核之前）因为只实现在了用户态，所以，即使是一个多线程程序，对于内核来说只能看到一个进程，这些线程就只能在一个CPU上运行，对于多核多线程来说是很致命的。

从实现的角度，Linux的线程（LWP）是通过pthread库创建/使用的。而进程和线程的创建都调用了clone()系统调用（kernel/fork.c ）。区别是两者使用了不同的flags。

进程管理

进程状态

TASK_RUNNING: The process is either executing on a CPU or waiting to be executed
TASK_INTERRUPTIBLE: The process is suspended (sleeping) until some condition becomes true.
TASK_UNINTERRUPTIBLE: Like TASK_INTERRUPTIBLE, except that delivering a signal to the sleeping process leaves its state unchanged.

PID

PID用来区分不同的进程结构体。Linux中最大PID数目可以在/proc/sys/kernel/pid_max中查看。

每个进程/轻量级进程都分配有一个唯一的PID。但是对于同一进程中的线程来说，我们拿到的是进程ID确是相同的，这是怎么实现的呢？

Linux为了兼容POSIX标准，利用了线程组（thread group）这一概念。所有的线程都会把线程组里面第一个线程的PID存在tgid字段内。==getpid()系统调用返回的实际上是tgid的值。==

/**
 * sys_getpid - return the thread group id of the current process
 *
 * Note, despite the name, this returns the tgid not the pid.  The tgid and
 * the pid are identical unless CLONE_THREAD was specified on clone() in
 * which case the tgid is the same in all threads of the same group.
 *
 * This is SMP safe as current->tgid does not change.
 */
SYSCALL_DEFINE0(getpid)
{
    return task_tgid_vnr(current);
}

bitmap管理PID

IDR管理PID

PID: replace pid bitmap implementation with IDR API commit, commit

进程切换

==TLDR：进程在调用schedule()方法时，将当前进程运行的寄存器信息保存在task_struct->thread_info内，同时从进程B中的task_struct->thread_info中加载B运行时的寄存器信息。==