BUG: Invalid wait context

发表于 2025-05-14

BUG：非法的等待上下文

1. 问题起因

最近在升级产品内核。为了让问题尽早暴露，并方便定位，开启了很多DEBUG选项。

CONFIG_DEBUG_LOCK_ALLOC=y    //其中一个配置项

在测试中遇到了异常日志打印：

May 13 15:16:32 kern.warn kernel: [ 1880.242388] =============================
May 13 15:16:32 kern.warn kernel: [ 1880.242429] [ BUG: Invalid wait context ]
May 13 15:16:32 kern.warn kernel: [ 1880.242469] 6.1 #3 Tainted: G           O      
May 13 15:16:32 kern.warn kernel: [ 1880.242517] -----------------------------
May 13 15:16:32 kern.warn kernel: [ 1880.242557] tc/6787 is trying to lock:
May 13 15:16:32 kern.warn kernel: [ 1880.242595] ffff888100e66298 (&dev->mutex){....}-{3:3}, at: netif_freeze_queues+0x4b/0x90
May 13 15:16:32 kern.warn kernel: [ 1880.242677] other info that might help us debug this:
May 13 15:16:32 kern.warn kernel: [ 1880.242724] context-{4:4}
May 13 15:16:32 kern.warn kernel: [ 1880.242754] 2 locks held by tc/6787:
May 13 15:16:32 kern.warn kernel: [ 1880.242792]  #0: ffffffff81a42e08 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x153/0x5b0
May 13 15:16:32 kern.warn kernel: [ 1880.242874]  #1: ffff888100ef1438 (&dev->tx_global_lock){+.-.}-{2:2}, at: dev_deactivate_many+0xbb/0x410
May 13 15:16:32 kern.warn kernel: [ 1880.242962] stack backtrace:
May 13 15:16:32 kern.warn kernel: [ 1880.242996] CPU: 0 PID: 6787 Comm: tc Kdump: loaded Tainted: G           O       6.1 #3
May 13 15:16:32 kern.warn kernel: [ 1880.243069] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
May 13 15:16:32 kern.warn kernel: [ 1880.243159] Call Trace:
May 13 15:16:32 kern.warn kernel: [ 1880.243188]  <TASK>
May 13 15:16:32 kern.warn kernel: [ 1880.243216]  dump_stack_lvl+0x76/0xaa
May 13 15:16:32 kern.warn kernel: [ 1880.243257]  dump_stack+0x10/0x16
May 13 15:16:32 kern.warn kernel: [ 1880.243293]  __lock_acquire.cold+0x420/0x474
May 13 15:16:32 kern.warn kernel: [ 1880.243339]  lock_acquire+0xc2/0x2b0
May 13 15:16:32 kern.warn kernel: [ 1880.243379]  ? netif_freeze_queues+0x4b/0x90
May 13 15:16:32 kern.warn kernel: [ 1880.243422]  ? lock_acquire+0xc2/0x2b0
May 13 15:16:32 kern.warn kernel: [ 1880.243462]  ? dev_deactivate_many+0xbb/0x410
May 13 15:16:32 kern.warn kernel: [ 1880.243507]  _raw_spin_lock+0x2f/0x50
May 13 15:16:32 kern.warn kernel: [ 1880.243547]  ? netif_freeze_queues+0x4b/0x90
May 13 15:16:32 kern.warn kernel: [ 1880.243591]  netif_freeze_queues+0x4b/0x90
May 13 15:16:32 kern.warn kernel: [ 1880.243633]  ? dev_deactivate_many+0x1f/0x410
May 13 15:16:32 kern.warn kernel: [ 1880.243677]  dev_deactivate_many+0xc3/0x410
May 13 15:16:32 kern.warn kernel: [ 1880.243721]  dev_deactivate+0x3c/0x70
May 13 15:16:32 kern.warn kernel: [ 1880.243760]  qdisc_graft+0x5d9/0x800
May 13 15:16:32 kern.warn kernel: [ 1880.243800]  tc_modify_qdisc+0x68f/0xa70
May 13 15:16:32 kern.warn kernel: [ 1880.243843]  rtnetlink_rcv_msg+0x17d/0x5b0
May 13 15:16:32 kern.warn kernel: [ 1880.243885]  ? lock_acquire+0xc2/0x2b0
May 13 15:16:32 kern.warn kernel: [ 1880.243924]  ? find_held_lock+0x31/0x90
May 13 15:16:32 kern.warn kernel: [ 1880.243964]  ? netlink_deliver_tap+0xfb/0x520
May 13 15:16:32 kern.warn kernel: [ 1880.244010]  ? rtnl_fdb_dump+0x2d0/0x2d0
May 13 15:16:32 kern.warn kernel: [ 1880.244050]  netlink_rcv_skb+0x57/0x110
May 13 15:16:32 kern.warn kernel: [ 1880.244092]  rtnetlink_rcv+0x15/0x20
May 13 15:16:32 kern.warn kernel: [ 1880.244129]  netlink_unicast+0x1ab/0x290
May 13 15:16:32 kern.warn kernel: [ 1880.244171]  netlink_sendmsg+0x22d/0x480
May 13 15:16:32 kern.warn kernel: [ 1880.244212]  ? netlink_unicast+0x290/0x290
May 13 15:16:32 kern.warn kernel: [ 1880.244265]  ____sys_sendmsg+0x230/0x260
May 13 15:16:32 kern.warn kernel: [ 1880.244307]  ___sys_sendmsg+0x96/0xd0
May 13 15:16:32 kern.warn kernel: [ 1880.244347]  __sys_sendmsg+0x7d/0xd0
May 13 15:16:32 kern.warn kernel: [ 1880.244385]  __x64_sys_sendmsg+0x1d/0x30
May 13 15:16:32 kern.warn kernel: [ 1880.244426]  do_syscall_64+0x37/0x90
May 13 15:16:32 kern.warn kernel: [ 1880.244465]  entry_SYSCALL_64_after_hwframe+0x64/0xce
May 13 15:16:32 kern.warn kernel: [ 1880.244515] RIP: 0033:0x7f0ba008dbb0
May 13 15:16:32 kern.warn kernel: [ 1880.244558] Code: 00 f7 d8 64 89 02 b8 ff ff ff ff eb bb 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 80 3d 79 46 0d 00 00 74 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 54
May 13 15:16:32 kern.warn kernel: [ 1880.244708] RSP: 002b:00007ffea3bc3548 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
May 13 15:16:32 kern.warn kernel: [ 1880.244778] RAX: ffffffffffffffda RBX: 000000006823c4c2 RCX: 00007f0ba008dbb0
May 13 15:16:32 kern.warn kernel: [ 1880.244842] RDX: 0000000000000000 RSI: 00007ffea3bc35a0 RDI: 0000000000000007
May 13 15:16:32 kern.warn kernel: [ 1880.244918] RBP: 000055cd2dc273a0 R08: 0000000000000010 R09: 0000000000000000
May 13 15:16:32 kern.warn kernel: [ 1880.244986] R10: 00007f0b9ffa1150 R11: 0000000000000202 R12: 00007ffea3bc35a0
May 13 15:16:32 kern.warn kernel: [ 1880.245048] R13: 000055cd2dc08b6e R14: 0000000000000000 R15: 0000000000000000
May 13 15:16:32 kern.warn kernel: [ 1880.245111]  </TASK>

2. 初步分析

单从日志上看，象是 netif_freeze_queues+0x4b/0x90 对 &dev->mutex 进行上锁。但是这是在原子上下文，不能使用可休眠的锁。于是触发了异常日志。

但是 netif_freeze_queues+0x4b 实际上并没有对 &dev->mutex 上mutex锁，而是对 &txq->_xmit_lock 上 spin_lock锁。

static void netif_freeze_queues(struct net_device *dev)
{
        unsigned int i;
        int cpu;

        cpu = smp_processor_id();
        for (i = 0; i < dev->num_tx_queues; i++) {
                struct netdev_queue *txq = netdev_get_tx_queue(dev, i);

                /* We are the only thread of execution doing a
                 * freeze, but we have to grab the _xmit_lock in
                 * order to synchronize with threads which are in
                 * the ->hard_start_xmit() handler and already
                 * checked the frozen bit.
                 */
                __netif_tx_lock(txq, cpu);    //here
                set_bit(__QUEUE_STATE_FROZEN, &txq->state);
                __netif_tx_unlock(txq);
        }
}

static inline void __netif_tx_lock(struct netdev_queue *txq, int cpu)
{
        spin_lock(&txq->_xmit_lock);          //here
        /* Pairs with READ_ONCE() in __dev_queue_xmit() */
        WRITE_ONCE(txq->xmit_lock_owner, cpu);
}

什么情况，内核怎么会将 spin_lock(&txq->_xmit_lock) 与 mutex_lock(&dev->mutex) 混淆了？？？

3. 抽丝剥茧

那我们就看看日志中的lockdep_map的详细信息吧。用crash工具实时调试内核。

crash> lockdep_map -x ffff888100e66298        //日志中提供的地址
struct lockdep_map {
  key = 0xffffffff81cfcc20 <__lockdep_no_validate__>,    //这个key不大对劲
  class_cache = {0xffffffff81f08900 <lock_classes+47232>, 0x0},
  name = 0xffffffff814f2f17 "&txq->_xmit_lock",    //确实是 对 &txq->_xmit_lock 进行上锁
  wait_type_outer = 0x0,
  wait_type_inner = 0x2,
  lock_type = 0x0
}

crash> raw_spinlock_t -x ffff888100e66280  //对应的spin_lock
struct raw_spinlock_t {
  raw_lock = {
    {
      val = {
        counter = 0x0
      },
      {
        locked = 0x0,
        pending = 0x0
      },
      {
        locked_pending = 0x0,
        tail = 0x0
      }
    }
  },
  magic = 0xdead4ead,
  owner_cpu = 0xffffffff,
  owner = 0xffffffffffffffff,
  dep_map = {
    key = 0xffffffff81cfcc20 <__lockdep_no_validate__>,
    class_cache = {0xffffffff81f08900 <lock_classes+47232>, 0x0},
    name = 0xffffffff814f2f17 "&txq->_xmit_lock",
    wait_type_outer = 0x0,
    wait_type_inner = 0x2,
    lock_type = 0x0
  }
}

从调试信息可以看出，确实是对 &txq->_xmit_lock 上锁，那么为什么会出现 &dev->mutex 呢？

从代码上看，在 __lock_acquire 中会从 class_cache中取得 class来做一些检查。

调用路径： spin_lock–>raw_spin_lock–>_raw_spin_lock–>__raw_spin_lock–>spin_acquire–>lock_acquire_exclusive–>lock_acquire–>__lock_acquire

static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
                          int trylock, int read, int check, int hardirqs_off,
                          struct lockdep_map *nest_lock, unsigned long ip,
                          int references, int pin_count)
{
        struct task_struct *curr = current;
        struct lock_class *class = NULL;
        struct held_lock *hlock;
        unsigned int depth;
        int chain_head = 0;
        int class_idx;
        u64 chain_key;

        if (unlikely(!debug_locks))
                return 0;

        if (!prove_locking || lock->key == &__lockdep_no_validate__)
                check = 0;

        if (subclass < NR_LOCKDEP_CACHING_CLASSES)
                class = lock->class_cache[subclass];      //here
        /*
         * Not cached?
         */
        if (unlikely(!class)) {
                class = register_lock_class(lock, subclass, 0);  //or here
                if (!class)
                        return 0;
        }
//省略无关
        hlock = curr->held_locks + depth;

        class_idx = class - lock_classes;

        hlock->class_idx = class_idx;
}

我们检查下 lockdep_map的 class

crash> lock_class -x 0xffffffff81f08900
struct lock_class {
  hash_entry = {
    next = 0x0,
    pprev = 0xffffffff81f290c0 <lock_classes+180288>
  },
  lock_entry = {
    next = 0xffffffff81f089d0 <lock_classes+47440>,
    prev = 0xffffffff81f08850 <lock_classes+47056>
  },
  locks_after = {
    next = 0xffffffff81f08920 <lock_classes+47264>,
    prev = 0xffffffff81f08920 <lock_classes+47264>
  },
  locks_before = {
    next = 0xffffffff81f08930 <lock_classes+47280>,
    prev = 0xffffffff81f08930 <lock_classes+47280>
  },
  key = 0xffffffff81cfcc20 <__lockdep_no_validate__>,    //key也是有点奇怪
  subclass = 0x0,
  dep_gen_id = 0x0,
  usage_mask = 0x100,
  usage_traces = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xffffffff82158a98 <stack_trace+72056>, 0x0},
  name_version = 0x1,
  name = 0xffffffff814e882a "&dev->mutex",    //真凶出现
  wait_type_inner = 0x3,
  wait_type_outer = 0x0,
  lock_type = 0x0
}

解析到这里，也就明白为什么出现 &dev->mutex 相关的错误日志的直接原因了。但是为什么 &txq->_xmit_lock 的class 是 &dev->mutex 呢？

4. 继续深入

在前面的分析中我们注意到2个疑点，一个是 &dev->mutex, 另一个是 __lockdep_no_validate__ 。

4.1 `&dev->mutex` 的初始化：

void device_initialize(struct device *dev)
{
//省略无关
        mutex_init(&dev->mutex);                  //初始化mutex
        lockdep_set_novalidate_class(&dev->mutex);  //设置novalidate
//省略无关
}

#define mutex_init(mutex)                                               \
do {                                                                    \
        static struct lock_class_key __key;                             \
                                                                        \
        __mutex_init((mutex), #mutex, &__key);                          \
} while (0)


#define lockdep_set_novalidate_class(lock) \
        lockdep_set_class_and_name(lock, &__lockdep_no_validate__, #lock)

第一个是初始化 mutex &dev->mutex , key是一个静态全局变量。调用路径： mutex_init–>__mutex_init–>debug_mutex_init->lockdep_init_map_wait–>lockdep_init_map_waits–>lockdep_init_map_type–>

第二个时设置novalidate，其实就是将mutex的 key替换成 lockdep_no_validate 全局变量，最终调用 lockdep_init_map_type，其他参数值不变。

4.2 `&dev->mutex` 的上锁

mutex_lock执行路径： mutex_lock–>mutex_lock_nested–>__mutex_lock–>__mutex_lock_common–>mutex_trylock–>mutex_acquire–>lock_acquire_exclusive–>lock_acquire–>__lock_acquire–>[first time]register_lock_class

在 lock_acquire_exclusive 之后的路径和 spin_lock相同。初次使用时会对class进行注册。

//伪代码
static struct lock_class *
register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
{
  //1. 在classhash_table中查找 key对应的calss， 条件是 class->key == key
  //2. 如果找到，直接返回对应的class
  //3. 如果不存在，则新建一个class，并初始化
  //4. 添加到classhash_table中
  //5. lock->class_cache[0] = class; 返回class
}

所以在mutex_lock(&dev->mutex) 初次执行时，触发了 register_lock_class，创建了一个 key==__lockdep_no_validate__的class，并添加到哈希表中。后续直接就从 lock->class_cache[0] 中取class。

5. 黎明曙光

熟悉了 &dev->mutex 的相关操作后，回过头来看 &txq->_xmit_lock。

5.1 `&txq->_xmit_lock`的初始化

static int netif_alloc_netdev_queues(struct net_device *dev)
{
//省略无关  
        netdev_for_each_tx_queue(dev, netdev_init_one_queue, NULL);
//省略无关
        return 0;
}

static void netdev_init_one_queue(struct net_device *dev,
                                  struct netdev_queue *queue, void *_unused)
{
        /* Initialize queue lock */
        spin_lock_init(&queue->_xmit_lock);
//省略无关
}

# define spin_lock_init(lock)                                   \
do {                                                            \
        static struct lock_class_key __key;                     \
                                                                \
        __raw_spin_lock_init(spinlock_check(lock),              \
                             #lock, &__key, LD_WAIT_CONFIG);    \
} while (0)

调用路径： spin_lock_init–>__raw_spin_lock_init–>lockdep_init_map_wait–>lockdep_init_map_waits–>lockdep_init_map_type

可以看到初始化的 key是个静态全局变量，但我们实际观察到的是 lockdep_no_validate 。

继续搜索，发现在代码中对特定设备的 &txq->_xmit_lock 进行了 novalidate处理。

static void __init imq_dev_set_lockdep_one(struct net_device *dev,
                            struct netdev_queue *txq, void *arg)
{
        lockdep_set_novalidate_class(&txq->_xmit_lock);
}

这个会将lockdep_map的key 替换成 lockdep_no_validate 。

看到这里，想必你已经基本明白问题的最终原因了。不过不急，我们把最后一步做完。

5.2 `&txq->_xmit_lock`的上锁

在第3节我们分析了部分 spin_lock调用路径的 __lock_acquire 代码。当时只是为了查看 class。而这次我们看看class 从哪里来的。

调用路径： spin_lock–>raw_spin_lock–>_raw_spin_lock–>__raw_spin_lock–>spin_acquire–>lock_acquire_exclusive–>lock_acquire–>__lock_acquire–>register_lock_class

当第一次调用spin_lock时，由于 lockdep_map.class_cache[subclass] == NULL, 进入 register_lock_class。使用 key 为 lockdep_no_validate 在哈希表中进行查找。注意因为mutex_lock(&dev->mutex) 已经执行过，所以哈希表中已经有 key 为 lockdep_no_validate 的表项。所以在伪代码的第2步，就获取到class了，且这个class的name是 &dev->mutex。随后 lock->class_cache[0] = class 。也就是我们通过crash工具查看到的结果。

//伪代码
static struct lock_class *
register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
{
  //1. 在classhash_table中查找 key对应的calss， 条件是 class->key == key
  //2. 如果找到，跳转第5步
  //3. 如果不存在，则新建一个class，并初始化
  //4. 添加到classhash_table中
  //5. lock->class_cache[0] = class; 返回class
}

6. 水落石出

分析到这里，结果已然跃于纸上。问题在于 &dev->mutex 和 &txq->_xmit_lock 都使用了 lockdep_set_novalidate_class 将其key值设为相同的 lockdep_no_validate 。而 &dev->mutex 的上锁操作较早，将 &dev->mutex 的class添加到哈希表中。后续 &txq->_xmit_lock 的上锁操作，就会根据 key为 lockdep_no_validate 查找到 &dev->mutex 的class, 导致随后的坚持出现问题，打印异常日志。

7. 后记

这个问题是锁的lockdep_map相关debug处理时出现的。如果不开启对应DEBUG选项，则不会有这个错误打印。

本文访问次数：... 次