cve: ./data/2021/39xxx/CVE-2021-39939.json An uncontrolled resource consumption vulnerability in GitLab Runner affecting all versions starting from 13.7 before 14.3.6, all versions starting from 14.4 before 14.4.4, all versions starting from 14.5 before 14.5.2, allows an attacker triggering a job with a specially crafted docker image to exhaust resources on runner manager analysis: {"Container":"GitLab Runner使用的容器镜像","CVE_Reason":"攻击者可以通过触发一个带有特殊构造的docker镜像的任务,导致容器运行时无限制地消耗宿主机资源,例如CPU或内存资源。","CVE_Consequence":"该CVE可能导致宿主机资源被耗尽,影响其他正常任务的运行,严重程度较高。"} cve: ./data/2021/43xxx/CVE-2021-43784.json runc is a CLI tool for spawning and running containers on Linux according to the OCI specification. In runc, netlink is used internally as a serialization system for specifying the relevant container configuration to the `C` portion of the code (responsible for the based namespace setup of containers). In all versions of runc prior to 1.0.3, the encoder did not handle the possibility of an integer overflow in the 16-bit length field for the byte array attribute type, meaning that a large enough malicious byte array attribute could result in the length overflowing and the attribute contents being parsed as netlink messages for container configuration. This vulnerability requires the attacker to have some control over the configuration of the container and would allow the attacker to bypass the namespace restrictions of the container by simply adding their own netlink payload which disables all namespaces. The main users impacted are those who allow untrusted images with untrusted configurations to run on their machines (such as with shared cloud infrastructure). runc version 1.0.3 contains a fix for this bug. As a workaround, one may try disallowing untrusted namespace paths from your container. It should be noted that untrusted namespace paths would allow the attacker to disable namespace protections entirely even in the absence of this bug. analysis: ```json {"Container":"runc","CVE_Reason":"恶意字节数组属性可能导致整数溢出,从而使容器配置解析为netlink消息,攻击者可借此禁用所有命名空间限制,进而导致容器消耗更多宿主机的物理资源,例如内存和CPU资源","CVE_Consequence":"该CVE允许攻击者绕过容器的命名空间限制,可能导致宿主机资源被滥用,严重程度高"} ``` cve: ./data/2021/47xxx/CVE-2021-47011.json In the Linux kernel, the following vulnerability has been resolved: mm: memcontrol: slab: fix obtain a reference to a freeing memcg Patch series "Use obj_cgroup APIs to charge kmem pages", v5. Since Roman's series "The new cgroup slab memory controller" applied. All slab objects are charged with the new APIs of obj_cgroup. The new APIs introduce a struct obj_cgroup to charge slab objects. It prevents long-living objects from pinning the original memory cgroup in the memory. But there are still some corner objects (e.g. allocations larger than order-1 page on SLUB) which are not charged with the new APIs. Those objects (include the pages which are allocated from buddy allocator directly) are charged as kmem pages which still hold a reference to the memory cgroup. E.g. We know that the kernel stack is charged as kmem pages because the size of the kernel stack can be greater than 2 pages (e.g. 16KB on x86_64 or arm64). If we create a thread (suppose the thread stack is charged to memory cgroup A) and then move it from memory cgroup A to memory cgroup B. Because the kernel stack of the thread hold a reference to the memory cgroup A. The thread can pin the memory cgroup A in the memory even if we remove the cgroup A. If we want to see this scenario by using the following script. We can see that the system has added 500 dying cgroups (This is not a real world issue, just a script to show that the large kmallocs are charged as kmem pages which can pin the memory cgroup in the memory). #!/bin/bash cat /proc/cgroups | grep memory cd /sys/fs/cgroup/memory echo 1 > memory.move_charge_at_immigrate for i in range{1..500} do mkdir kmem_test echo $$ > kmem_test/cgroup.procs sleep 3600 & echo $$ > cgroup.procs echo `cat kmem_test/cgroup.procs` > cgroup.procs rmdir kmem_test done cat /proc/cgroups | grep memory This patchset aims to make those kmem pages to drop the reference to memory cgroup by using the APIs of obj_cgroup. Finally, we can see that the number of the dying cgroups will not increase if we run the above test script. This patch (of 7): The rcu_read_lock/unlock only can guarantee that the memcg will not be freed, but it cannot guarantee the success of css_get (which is in the refill_stock when cached memcg changed) to memcg. rcu_read_lock() memcg = obj_cgroup_memcg(old) __memcg_kmem_uncharge(memcg) refill_stock(memcg) if (stock->cached != memcg) // css_get can change the ref counter from 0 back to 1. css_get(&memcg->css) rcu_read_unlock() This fix is very like the commit: eefbfa7fd678 ("mm: memcg/slab: fix use after free in obj_cgroup_charge") Fix this by holding a reference to the memcg which is passed to the __memcg_kmem_uncharge() before calling __memcg_kmem_uncharge(). analysis: {"Container":"[任何使用受影响Linux内核版本的容器]","CVE_Reason":"[容器中的内存分配操作可能导致对宿主机内存资源的异常引用和占用,进而影响内存资源回收]","CVE_Consequence":"[该CVE可能导致宿主机上的内存资源无法有效释放,长期运行可能造成内存泄漏,严重程度为高]"} cve: ./data/2021/47xxx/CVE-2021-47119.json In the Linux kernel, the following vulnerability has been resolved: ext4: fix memory leak in ext4_fill_super Buffer head references must be released before calling kill_bdev(); otherwise the buffer head (and its page referenced by b_data) will not be freed by kill_bdev, and subsequently that bh will be leaked. If blocksizes differ, sb_set_blocksize() will kill current buffers and page cache by using kill_bdev(). And then super block will be reread again but using correct blocksize this time. sb_set_blocksize() didn't fully free superblock page and buffer head, and being busy, they were not freed and instead leaked. This can easily be reproduced by calling an infinite loop of: systemctl start .mount, and systemctl stop .mount ... since systemd creates a cgroup for each slice which it mounts, and the bh leak get amplified by a dying memory cgroup that also never gets freed, and memory consumption is much more easily noticed. analysis: {"Container":"[使用ext4文件系统的容器镜像]","CVE_Reason":"[容器的运行会导致宿主机内存资源泄漏,持续消耗更多物理内存]","CVE_Consequence":"[该CVE会导致内存泄漏,随着容器频繁挂载和卸载ext4文件系统,宿主机的内存消耗会不断增加,最终可能导致宿主机内存耗尽,严重程度为高]"} cve: ./data/2021/47xxx/CVE-2021-47408.json In the Linux kernel, the following vulnerability has been resolved: netfilter: conntrack: serialize hash resizes and cleanups Syzbot was able to trigger the following warning [1] No repro found by syzbot yet but I was able to trigger similar issue by having 2 scripts running in parallel, changing conntrack hash sizes, and: for j in `seq 1 1000` ; do unshare -n /bin/true >/dev/null ; done It would take more than 5 minutes for net_namespace structures to be cleaned up. This is because nf_ct_iterate_cleanup() has to restart everytime a resize happened. By adding a mutex, we can serialize hash resizes and cleanups and also make get_next_corpse() faster by skipping over empty buckets. Even without resizes in the picture, this patch considerably speeds up network namespace dismantles. [1] INFO: task syz-executor.0:8312 can't die for more than 144 seconds. task:syz-executor.0 state:R running task stack:25672 pid: 8312 ppid: 6573 flags:0x00004006 Call Trace: context_switch kernel/sched/core.c:4955 [inline] __schedule+0x940/0x26f0 kernel/sched/core.c:6236 preempt_schedule_common+0x45/0xc0 kernel/sched/core.c:6408 preempt_schedule_thunk+0x16/0x18 arch/x86/entry/thunk_64.S:35 __local_bh_enable_ip+0x109/0x120 kernel/softirq.c:390 local_bh_enable include/linux/bottom_half.h:32 [inline] get_next_corpse net/netfilter/nf_conntrack_core.c:2252 [inline] nf_ct_iterate_cleanup+0x15a/0x450 net/netfilter/nf_conntrack_core.c:2275 nf_conntrack_cleanup_net_list+0x14c/0x4f0 net/netfilter/nf_conntrack_core.c:2469 ops_exit_list+0x10d/0x160 net/core/net_namespace.c:171 setup_net+0x639/0xa30 net/core/net_namespace.c:349 copy_net_ns+0x319/0x760 net/core/net_namespace.c:470 create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226 ksys_unshare+0x445/0x920 kernel/fork.c:3128 __do_sys_unshare kernel/fork.c:3202 [inline] __se_sys_unshare kernel/fork.c:3200 [inline] __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3200 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f63da68e739 RSP: 002b:00007f63d7c05188 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007f63da792f80 RCX: 00007f63da68e739 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000000 RBP: 00007f63da6e8cc4 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f63da792f80 R13: 00007fff50b75d3f R14: 00007f63d7c05300 R15: 0000000000022000 Showing all locks held in the system: 1 lock held by khungtaskd/27: #0: ffffffff8b980020 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:6446 2 locks held by kworker/u4:2/153: #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic_long_set include/linux/atomic/atomic-long.h:41 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: atomic_long_set include/linux/atomic/atomic-instrumented.h:1198 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:634 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:661 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x896/0x1690 kernel/workqueue.c:2268 #1: ffffc9000140fdb0 ((kfence_timer).work){+.+.}-{0:0}, at: process_one_work+0x8ca/0x1690 kernel/workqueue.c:2272 1 lock held by systemd-udevd/2970: 1 lock held by in:imklog/6258: #0: ffff88807f970ff0 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0xe9/0x100 fs/file.c:990 3 locks held by kworker/1:6/8158: 1 lock held by syz-executor.0/8312: 2 locks held by kworker/u4:13/9320: 1 lock held by ---truncated--- analysis: {"Container":"[任何使用受影响Linux内核版本的容器镜像]","CVE_Reason":"[容器内的网络连接跟踪操作可能导致宿主机CPU资源被大量占用,因为net_namespace结构清理时间过长]","CVE_Consequence":"[该CVE可能导致宿主机CPU资源耗尽,严重程度较高]"} cve: ./data/2021/47xxx/CVE-2021-47492.json In the Linux kernel, the following vulnerability has been resolved: mm, thp: bail out early in collapse_file for writeback page Currently collapse_file does not explicitly check PG_writeback, instead, page_has_private and try_to_release_page are used to filter writeback pages. This does not work for xfs with blocksize equal to or larger than pagesize, because in such case xfs has no page->private. This makes collapse_file bail out early for writeback page. Otherwise, xfs end_page_writeback will panic as follows. page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:ffff0003f88c86a8 index:0x0 pfn:0x84ef32 aops:xfs_address_space_operations [xfs] ino:30000b7 dentry name:"libtest.so" flags: 0x57fffe0000008027(locked|referenced|uptodate|active|writeback) raw: 57fffe0000008027 ffff80001b48bc28 ffff80001b48bc28 ffff0003f88c86a8 raw: 0000000000000000 0000000000000000 00000000ffffffff ffff0000c3e9a000 page dumped because: VM_BUG_ON_PAGE(((unsigned int) page_ref_count(page) + 127u <= 127u)) page->mem_cgroup:ffff0000c3e9a000 ------------[ cut here ]------------ kernel BUG at include/linux/mm.h:1212! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: BUG: Bad page state in process khugepaged pfn:84ef32 xfs(E) page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:0 index:0x0 pfn:0x84ef32 libcrc32c(E) rfkill(E) aes_ce_blk(E) crypto_simd(E) ... CPU: 25 PID: 0 Comm: swapper/25 Kdump: loaded Tainted: ... pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) Call trace: end_page_writeback+0x1c0/0x214 iomap_finish_page_writeback+0x13c/0x204 iomap_finish_ioend+0xe8/0x19c iomap_writepage_end_bio+0x38/0x50 bio_endio+0x168/0x1ec blk_update_request+0x278/0x3f0 blk_mq_end_request+0x34/0x15c virtblk_request_done+0x38/0x74 [virtio_blk] blk_done_softirq+0xc4/0x110 __do_softirq+0x128/0x38c __irq_exit_rcu+0x118/0x150 irq_exit+0x1c/0x30 __handle_domain_irq+0x8c/0xf0 gic_handle_irq+0x84/0x108 el1_irq+0xcc/0x180 arch_cpu_idle+0x18/0x40 default_idle_call+0x4c/0x1a0 cpuidle_idle_call+0x168/0x1e0 do_idle+0xb4/0x104 cpu_startup_entry+0x30/0x9c secondary_start_kernel+0x104/0x180 Code: d4210000 b0006161 910c8021 94013f4d (d4210000) ---[ end trace 4a88c6a074082f8c ]--- Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt analysis: {"Container":"使用Linux内核的容器(例如基于Ubuntu、CentOS等的容器)","CVE_Reason":"容器中的应用程序在特定条件下可能会触发内存管理漏洞,导致宿主机CPU资源被大量占用","CVE_Consequence":"该CVE可能导致宿主机上的CPU资源被过度消耗,影响系统稳定性,严重时可能引发内核崩溃,需及时更新内核以修复此问题"} cve: ./data/2021/4xxx/CVE-2021-4197.json An unprivileged write to the file handler flaw in the Linux kernel's control groups and namespaces subsystem was found in the way users have access to some less privileged process that are controlled by cgroups and have higher privileged parent process. It is actually both for cgroup2 and cgroup1 versions of control groups. A local user could use this flaw to crash the system or escalate their privileges on the system. analysis: {"Container":"[任何使用Linux内核控制组和命名空间子系统的容器]","CVE_Reason":"[容器中的低特权进程可能通过cgroups影响高特权父进程,导致宿主机资源被异常消耗,例如CPU或内存资源]","CVE_Consequence":"[该CVE可能导致系统崩溃或权限提升,严重程度较高]"} cve: ./data/2022/0xxx/CVE-2022-0532.json An incorrect sysctls validation vulnerability was found in CRI-O 1.18 and earlier. The sysctls from the list of "safe" sysctls specified for the cluster will be applied to the host if an attacker is able to create a pod with a hostIPC and hostNetwork kernel namespace. analysis: {"Container":"CRI-O 1.18及更早版本","CVE_Reason":"攻击者可以通过创建具有hostIPC和hostNetwork内核命名空间的Pod,将'safe' sysctls列表中指定的sysctls应用到主机,从而可能导致宿主机资源被过度消耗,例如CPU、内存或硬盘空间资源。","CVE_Consequence":"该CVE可能允许攻击者滥用容器权限影响宿主机系统,导致资源耗尽,严重程度较高。"} cve: ./data/2022/24xxx/CVE-2022-24122.json kernel/ucount.c in the Linux kernel 5.14 through 5.16.4, when unprivileged user namespaces are enabled, allows a use-after-free and privilege escalation because a ucounts object can outlive its namespace. analysis: {"Container":"[使用受影响内核版本的任何容器]","CVE_Reason":"[容器内的进程可能利用此漏洞触发use-after-free,导致CPU资源被大量占用以维持攻击或提权过程]","CVE_Consequence":"[该CVE可能导致特权 escalation,攻击者可在宿主机上执行任意代码,严重程度:高]"} cve: ./data/2022/27xxx/CVE-2022-27652.json A flaw was found in cri-o, where containers were incorrectly started with non-empty default permissions. A vulnerability was found in Moby (Docker Engine) where containers started incorrectly with non-empty inheritable Linux process capabilities. This flaw allows an attacker with access to programs with inheritable file capabilities to elevate those capabilities to the permitted set when execve(2) runs. analysis: {"Container":"cri-o、Moby (Docker Engine)","CVE_Reason":"容器以不正确的非空默认权限启动,可能导致攻击者利用继承的Linux进程能力提升权限,从而执行消耗大量CPU或内存资源的操作","CVE_Consequence":"攻击者可以提升权限,可能进一步导致宿主机资源被恶意消耗,严重程度较高"} cve: ./data/2022/28xxx/CVE-2022-28204.json A denial-of-service issue was discovered in MediaWiki 1.37.x before 1.37.2. Rendering of w/index.php?title=Special%3AWhatLinksHere&target=Property%3AP31&namespace=1&invert=1 can take more than thirty seconds. There is a DDoS risk. analysis: {"Container":"[存在CVE的容器镜像]","CVE_Reason":"[容器的运行会导致宿主机CPU资源被大量占用]","CVE_Consequence":"[该CVE可能导致渲染特定页面时CPU使用率升高,严重程度为中等,可能被用于DDoS攻击]"} cve: ./data/2022/39xxx/CVE-2022-39206.json Onedev is an open source, self-hosted Git Server with CI/CD and Kanban. When using Docker-based job executors, the Docker socket (e.g. /var/run/docker.sock on Linux) is mounted into each Docker step. Users that can define and trigger CI/CD jobs on a project could use this to control the Docker daemon on the host machine. This is a known dangerous pattern, as it can be used to break out of Docker containers and, in most cases, gain root privileges on the host system. This issue allows regular (non-admin) users to potentially take over the build infrastructure of a OneDev instance. Attackers need to have an account (or be able to register one) and need permission to create a project. Since code.onedev.io has the right preconditions for this to be exploited by remote attackers, it could have been used to hijack builds of OneDev itself, e.g. by injecting malware into the docker images that are built and pushed to Docker Hub. The impact is increased by this as described before. Users are advised to upgrade to 7.3.0 or higher. There are no known workarounds for this issue. analysis: {"Container":"OneDev","CVE_Reason":"容器的运行会导致宿主机的Docker守护进程被恶意控制,可能引发大量资源消耗,例如创建大量容器占用CPU、内存或硬盘空间资源","CVE_Consequence":"该CVE允许普通用户接管OneDev构建基础设施,可能导致宿主机资源被完全控制,严重程度为高"} cve: ./data/2022/48xxx/CVE-2022-48671.json In the Linux kernel, the following vulnerability has been resolved: cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all() syzbot is hitting percpu_rwsem_assert_held(&cpu_hotplug_lock) warning at cpuset_attach() [1], for commit 4f7e7236435ca0ab ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") missed that cpuset_attach() is also called from cgroup_attach_task_all(). Add cpus_read_lock() like what cgroup_procs_write_start() does. analysis: {"Container":"[使用cgroup的容器镜像]","CVE_Reason":"[由于缺少cpus_read_lock(),可能导致CPU资源分配异常,从而引发宿主机CPU资源被大量占用]","CVE_Consequence":"[该CVE可能导致容器任务附加过程中出现死锁或资源竞争问题,影响宿主机CPU资源分配稳定性,严重程度为中高]"} cve: ./data/2022/48xxx/CVE-2022-48799.json In the Linux kernel, the following vulnerability has been resolved: perf: Fix list corruption in perf_cgroup_switch() There's list corruption on cgrp_cpuctx_list. This happens on the following path: perf_cgroup_switch: list_for_each_entry(cgrp_cpuctx_list) cpu_ctx_sched_in ctx_sched_in ctx_pinned_sched_in merge_sched_in perf_cgroup_event_disable: remove the event from the list Use list_for_each_entry_safe() to allow removing an entry during iteration. analysis: {"Container":"[任何使用受影响Linux内核的容器镜像]","CVE_Reason":"[由于perf_cgroup_switch函数中的列表 corruption 问题,可能导致资源管理异常,从而使容器消耗过多的CPU资源]","CVE_Consequence":"[该CVE可能允许攻击者通过恶意操作引发CPU资源耗尽,影响宿主机和其他容器的稳定性,严重程度为高]"} cve: ./data/2022/48xxx/CVE-2022-48944.json In the Linux kernel, the following vulnerability has been resolved: sched: Fix yet more sched_fork() races Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") fixed a fork race vs cgroup, it opened up a race vs syscalls by not placing the task on the runqueue before it gets exposed through the pidhash. Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is trying to fix a single instance of this, instead fix the whole class of issues, effectively reverting this commit. analysis: {"Container":"[任何使用受影响Linux内核的容器镜像]","CVE_Reason":"[容器中的进程fork操作可能引发调度竞争条件,导致大量进程被创建并占用CPU资源]","CVE_Consequence":"[该CVE可能导致宿主机CPU资源被大量消耗,严重时可能引发宿主机性能下降或拒绝服务,严重程度较高]"} cve: ./data/2022/48xxx/CVE-2022-48988.json In the Linux kernel, the following vulnerability has been resolved: memcg: fix possible use-after-free in memcg_write_event_control() memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a call to __file_cft() which verified that the specified file is a regular cgroupfs file before further accesses. The cftype pointer returned from __file_cft() was no longer necessary and the commit inadvertently dropped the file type check with it allowing any file to slip through. With the invarients broken, the d_name and parent accesses can now race against renames and removals of arbitrary files and cause use-after-free's. Fix the bug by resurrecting the file type check in __file_cft(). Now that cgroupfs is implemented through kernfs, checking the file operations needs to go through a layer of indirection. Instead, let's check the superblock and dentry type. analysis: {"Container":"[任何使用受影响Linux内核版本的容器镜像]","CVE_Reason":"[容器内的漏洞利用可能导致宿主机内存资源被异常消耗,因为use-after-free问题可能引发内存泄漏或无节制的内存分配]",“CVE_Consequence”:“[该CVE可能导致宿主机内存资源耗尽,严重程度为高,因为它可以被恶意利用来影响整个系统的稳定性]”} cve: ./data/2022/49xxx/CVE-2022-49394.json In the Linux kernel, the following vulnerability has been resolved: blk-iolatency: Fix inflight count imbalances and IO hangs on offline iolatency needs to track the number of inflight IOs per cgroup. As this tracking can be expensive, it is disabled when no cgroup has iolatency configured for the device. To ensure that the inflight counters stay balanced, iolatency_set_limit() freezes the request_queue while manipulating the enabled counter, which ensures that no IO is in flight and thus all counters are zero. Unfortunately, iolatency_set_limit() isn't the only place where the enabled counter is manipulated. iolatency_pd_offline() can also dec the counter and trigger disabling. As this disabling happens without freezing the q, this can easily happen while some IOs are in flight and thus leak the counts. This can be easily demonstrated by turning on iolatency on an one empty cgroup while IOs are in flight in other cgroups and then removing the cgroup. Note that iolatency shouldn't have been enabled elsewhere in the system to ensure that removing the cgroup disables iolatency for the whole device. The following keeps flipping on and off iolatency on sda: echo +io > /sys/fs/cgroup/cgroup.subtree_control while true; do mkdir -p /sys/fs/cgroup/test echo '8:0 target=100000' > /sys/fs/cgroup/test/io.latency sleep 1 rmdir /sys/fs/cgroup/test sleep 1 done and there's concurrent fio generating direct rand reads: fio --name test --filename=/dev/sda --direct=1 --rw=randread \ --runtime=600 --time_based --iodepth=256 --numjobs=4 --bs=4k while monitoring with the following drgn script: while True: for css in css_for_each_descendant_pre(prog['blkcg_root'].css.address_of_()): for pos in hlist_for_each(container_of(css, 'struct blkcg', 'css').blkg_list): blkg = container_of(pos, 'struct blkcg_gq', 'blkcg_node') pd = blkg.pd[prog['blkcg_policy_iolatency'].plid] if pd.value_() == 0: continue iolat = container_of(pd, 'struct iolatency_grp', 'pd') inflight = iolat.rq_wait.inflight.counter.value_() if inflight: print(f'inflight={inflight} {disk_name(blkg.q.disk).decode("utf-8")} ' f'{cgroup_path(css.cgroup).decode("utf-8")}') time.sleep(1) The monitoring output looks like the following: inflight=1 sda /user.slice inflight=1 sda /user.slice ... inflight=14 sda /user.slice inflight=13 sda /user.slice inflight=17 sda /user.slice inflight=15 sda /user.slice inflight=18 sda /user.slice inflight=17 sda /user.slice inflight=20 sda /user.slice inflight=19 sda /user.slice <- fio stopped, inflight stuck at 19 inflight=19 sda /user.slice inflight=19 sda /user.slice If a cgroup with stuck inflight ends up getting throttled, the throttled IOs will never get issued as there's no completion event to wake it up leading to an indefinite hang. This patch fixes the bug by unifying enable handling into a work item which is automatically kicked off from iolatency_set_min_lat_nsec() which is called from both iolatency_set_limit() and iolatency_pd_offline() paths. Punting to a work item is necessary as iolatency_pd_offline() is called under spinlocks while freezing a request_queue requires a sleepable context. This also simplifies the code reducing LOC sans the comments and avoids the unnecessary freezes which were happening whenever a cgroup's latency target is newly set or cleared. analysis: {"Container":"[使用受影响Linux内核版本的任何容器镜像]","CVE_Reason":"[由于inflight计数不平衡,可能导致大量I/O操作被挂起或阻塞,从而间接导致CPU资源被持续占用以尝试处理这些未完成的I/O请求]","CVE_Consequence":"[该CVE可能导致宿主机上的I/O操作出现无限期挂起,影响容器及宿主机性能,严重程度为高]"} cve: ./data/2022/49xxx/CVE-2022-49411.json In the Linux kernel, the following vulnerability has been resolved: bfq: Make sure bfqg for which we are queueing requests is online Bios queued into BFQ IO scheduler can be associated with a cgroup that was already offlined. This may then cause insertion of this bfq_group into a service tree. But this bfq_group will get freed as soon as last bio associated with it is completed leading to use after free issues for service tree users. Fix the problem by making sure we always operate on online bfq_group. If the bfq_group associated with the bio is not online, we pick the first online parent. analysis: {"Container":"[使用BFQ IO调度器的容器]","CVE_Reason":"[容器的运行可能导致与已离线的cgroup关联的BFQ调度组被插入到服务树中,从而引发资源管理问题,可能间接导致宿主机资源消耗异常]","CVE_Consequence":"[该CVE可能导致资源分配错误,严重时可能引发宿主机的IO资源争用或不稳定,严重程度为高]"} cve: ./data/2022/49xxx/CVE-2022-49647.json In the Linux kernel, the following vulnerability has been resolved: cgroup: Use separate src/dst nodes when preloading css_sets for migration Each cset (css_set) is pinned by its tasks. When we're moving tasks around across csets for a migration, we need to hold the source and destination csets to ensure that they don't go away while we're moving tasks about. This is done by linking cset->mg_preload_node on either the mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the same cset->mg_preload_node for both the src and dst lists was deemed okay as a cset can't be both the source and destination at the same time. Unfortunately, this overloading becomes problematic when multiple tasks are involved in a migration and some of them are identity noop migrations while others are actually moving across cgroups. For example, this can happen with the following sequence on cgroup1: #1> mkdir -p /sys/fs/cgroup/misc/a/b #2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs #3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS & #4> PID=$! #5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks #6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs the process including the group leader back into a. In this final migration, non-leader threads would be doing identity migration while the group leader is doing an actual one. After #3, let's say the whole process was in cset A, and that after #4, the leader moves to cset B. Then, during #6, the following happens: 1. cgroup_migrate_add_src() is called on B for the leader. 2. cgroup_migrate_add_src() is called on A for the other threads. 3. cgroup_migrate_prepare_dst() is called. It scans the src list. 4. It notices that B wants to migrate to A, so it tries to A to the dst list but realizes that its ->mg_preload_node is already busy. 5. and then it notices A wants to migrate to A as it's an identity migration, it culls it by list_del_init()'ing its ->mg_preload_node and putting references accordingly. 6. The rest of migration takes place with B on the src list but nothing on the dst list. This means that A isn't held while migration is in progress. If all tasks leave A before the migration finishes and the incoming task pins it, the cset will be destroyed leading to use-after-free. This is caused by overloading cset->mg_preload_node for both src and dst preload lists. We wanted to exclude the cset from the src list but ended up inadvertently excluding it from the dst list too. This patch fixes the issue by separating out cset->mg_preload_node into ->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst preloadings don't interfere with each other. analysis: {"Container":"[任何使用cgroup的容器镜像]","CVE_Reason":"[由于cgroup迁移逻辑中的漏洞,可能导致在任务迁移过程中资源管理错误,从而引发宿主机CPU资源被大量占用以处理错误的迁移逻辑和防止use-after-free问题]",“CVE_Consequence”:“[该CVE可能导致宿主机上的资源管理混乱,严重时可能造成宿主机系统不稳定或崩溃,严重程度为高]”} cve: ./data/2023/28xxx/CVE-2023-28960.json An Incorrect Permission Assignment for Critical Resource vulnerability in Juniper Networks Junos OS Evolved allows a local, authenticated low-privileged attacker to copy potentially malicious files into an existing Docker container on the local system. A follow-on administrator could then inadvertently start the Docker container leading to the malicious files being executed as root. This issue only affects systems with Docker configured and enabled, which is not enabled by default. Systems without Docker started are not vulnerable to this issue. This issue affects Juniper Networks Junos OS Evolved: 20.4 versions prior to 20.4R3-S5-EVO; 21.2 versions prior to 21.2R3-EVO; 21.3 versions prior to 21.3R3-EVO; 21.4 versions prior to 21.4R2-EVO. This issue does not affect Juniper Networks Junos OS Evolved versions prior to 19.2R1-EVO. analysis: {"Container":"Docker容器","CVE_Reason":"低权限攻击者可以将恶意文件复制到现有Docker容器中,导致容器运行时消耗大量宿主机资源,例如CPU或内存","CVE_Consequence":"攻击者可通过执行恶意文件获取root权限,进一步消耗宿主机资源,可能导致系统性能下降或拒绝服务,严重程度高"} cve: ./data/2023/30xxx/CVE-2023-30549.json Apptainer is an open source container platform for Linux. There is an ext4 use-after-free flaw that is exploitable through versions of Apptainer < 1.1.0 and installations that include apptainer-suid < 1.1.8 on older operating systems where that CVE has not been patched. That includes Red Hat Enterprise Linux 7, Debian 10 buster (unless the linux-5.10 package is installed), Ubuntu 18.04 bionic and Ubuntu 20.04 focal. Use-after-free flaws in the kernel can be used to attack the kernel for denial of service and potentially for privilege escalation. Apptainer 1.1.8 includes a patch that by default disables mounting of extfs filesystem types in setuid-root mode, while continuing to allow mounting of extfs filesystems in non-setuid "rootless" mode using fuse2fs. Some workarounds are possible. Either do not install apptainer-suid (for versions 1.1.0 through 1.1.7) or set `allow setuid = no` in apptainer.conf. This requires having unprivileged user namespaces enabled and except for apptainer 1.1.x versions will disallow mounting of sif files, extfs files, and squashfs files in addition to other, less significant impacts. (Encrypted sif files are also not supported unprivileged in apptainer 1.1.x.). Alternatively, use the `limit containers` options in apptainer.conf/singularity.conf to limit sif files to trusted users, groups, and/or paths, and set `allow container extfs = no` to disallow mounting of extfs overlay files. The latter option by itself does not disallow mounting of extfs overlay partitions inside SIF files, so that's why the former options are also needed. analysis: {"Container":"Apptainer","CVE_Reason":"容器的运行可能会利用ext4 use-after-free漏洞,导致宿主机内核受到攻击,可能引发拒绝服务或特权提升,从而消耗更多CPU资源。","CVE_Consequence":"该CVE可能导致宿主机资源被大量占用,严重程度较高,可能引发拒绝服务或特权提升。"} cve: ./data/2023/30xxx/CVE-2023-30999.json IBM Security Access Manager Container (IBM Security Verify Access Appliance 10.0.0.0 through 10.0.6.1 and IBM Security Verify Access Docker 10.0.0.0 through 10.0.6.1) could allow an attacker to cause a denial of service due to uncontrolled resource consumption. IBM X-Force ID: 254651. analysis: {"Container":"IBM Security Access Manager Container","CVE_Reason":"容器的运行可能会由于未控制的资源消耗导致宿主机CPU或内存资源被大量占用","CVE_Consequence":"该CVE会导致拒绝服务攻击,严重程度较高"} cve: ./data/2023/32xxx/CVE-2023-32327.json IBM Security Access Manager Container (IBM Security Verify Access Appliance 10.0.0.0 through 10.0.6.1 and IBM Security Verify Access Docker 10.0.0.0 through 10.0.6.1) is vulnerable to an XML External Entity Injection (XXE) attack when processing XML data. A remote attacker could exploit this vulnerability to expose sensitive information or consume memory resources. IBM X-Force ID: 254783. analysis: {"Container":"IBM Security Access Manager Container","CVE_Reason":"[容器在处理XML数据时,因XXE攻击可能导致宿主机内存资源被大量消耗]","CVE_Consequence":"[该CVE可能使远程攻击者利用XXE漏洞消耗大量内存资源,导致系统性能下降或服务中断,严重程度为高]"} cve: ./data/2023/40xxx/CVE-2023-40453.json Docker Machine through 0.16.2 allows an attacker, who has control of a worker node, to provide crafted version data, which might potentially trick an administrator into performing an unsafe action (via escape sequence injection), or might have a data size that causes a denial of service to a bastion node. NOTE: This vulnerability only affects products that are no longer supported by the maintainer. analysis: {"Container":"Docker Machine","CVE_Reason":"[攻击者可以通过提供恶意版本数据导致 bastion 节点的资源耗尽,例如内存或CPU]","CVE_Consequence":"[该CVE可能导致拒绝服务攻击,严重程度中等]"} cve: ./data/2023/47xxx/CVE-2023-47633.json Traefik is an open source HTTP reverse proxy and load balancer. The traefik docker container uses 100% CPU when it serves as its own backend, which is an automatically generated route resulting from the Docker integration in the default configuration. This issue has been addressed in versions 2.10.6 and 3.0.0-beta5. Users are advised to upgrade. There are no known workarounds for this vulnerability. analysis: {"Container":"traefik","CVE_Reason":"容器的运行会导致宿主机CPU资源被大量占用","CVE_Consequence":"该CVE会导致Traefik容器在默认配置下消耗100%的CPU资源,严重程度为高"} cve: ./data/2023/52xxx/CVE-2023-52707.json In the Linux kernel, the following vulnerability has been resolved: sched/psi: Fix use-after-free in ep_remove_wait_queue() If a non-root cgroup gets removed when there is a thread that registered trigger and is polling on a pressure file within the cgroup, the polling waitqueue gets freed in the following path: do_rmdir cgroup_rmdir kernfs_drain_open_files cgroup_file_release cgroup_pressure_release psi_trigger_destroy However, the polling thread still has a reference to the pressure file and will access the freed waitqueue when the file is closed or upon exit: fput ep_eventpoll_release ep_free ep_remove_wait_queue remove_wait_queue This results in use-after-free as pasted below. The fundamental problem here is that cgroup_file_release() (and consequently waitqueue's lifetime) is not tied to the file's real lifetime. Using wake_up_pollfree() here might be less than ideal, but it is in line with the comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's one and can be considered as another special case. While this would be fixable by somehow making cgroup_file_release() be tied to the fput(), it would require sizable refactoring at cgroups or higher layer which might be more justifiable if we identify more cases like this. BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0 Write of size 4 at addr ffff88810e625328 by task a.out/4404 CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38 Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017 Call Trace: dump_stack_lvl+0x73/0xa0 print_report+0x16c/0x4e0 kasan_report+0xc3/0xf0 kasan_check_range+0x2d2/0x310 _raw_spin_lock_irqsave+0x60/0xc0 remove_wait_queue+0x1a/0xa0 ep_free+0x12c/0x170 ep_eventpoll_release+0x26/0x30 __fput+0x202/0x400 task_work_run+0x11d/0x170 do_exit+0x495/0x1130 do_group_exit+0x100/0x100 get_signal+0xd67/0xde0 arch_do_signal_or_restart+0x2a/0x2b0 exit_to_user_mode_prepare+0x94/0x100 syscall_exit_to_user_mode+0x20/0x40 do_syscall_64+0x52/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Allocated by task 4404: kasan_set_track+0x3d/0x60 __kasan_kmalloc+0x85/0x90 psi_trigger_create+0x113/0x3e0 pressure_write+0x146/0x2e0 cgroup_file_write+0x11c/0x250 kernfs_fop_write_iter+0x186/0x220 vfs_write+0x3d8/0x5c0 ksys_write+0x90/0x110 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 4407: kasan_set_track+0x3d/0x60 kasan_save_free_info+0x27/0x40 ____kasan_slab_free+0x11d/0x170 slab_free_freelist_hook+0x87/0x150 __kmem_cache_free+0xcb/0x180 psi_trigger_destroy+0x2e8/0x310 cgroup_file_release+0x4f/0xb0 kernfs_drain_open_files+0x165/0x1f0 kernfs_drain+0x162/0x1a0 __kernfs_remove+0x1fb/0x310 kernfs_remove_by_name_ns+0x95/0xe0 cgroup_addrm_files+0x67f/0x700 cgroup_destroy_locked+0x283/0x3c0 cgroup_rmdir+0x29/0x100 kernfs_iop_rmdir+0xd1/0x140 vfs_rmdir+0xfe/0x240 do_rmdir+0x13d/0x280 __x64_sys_rmdir+0x2c/0x30 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd analysis: {"Container":"[使用cgroup管理资源的容器]","CVE_Reason":"[容器内的线程在压力文件上轮询时,可能导致宿主机CPU资源被大量占用]","CVE_Consequence":"[该CVE可能引发use-after-free漏洞,导致系统不稳定或崩溃,严重程度较高]"} cve: ./data/2023/52xxx/CVE-2023-52848.json In the Linux kernel, the following vulnerability has been resolved: f2fs: fix to drop meta_inode's page cache in f2fs_put_super() syzbot reports a kernel bug as below: F2FS-fs (loop1): detect filesystem reference count leak during umount, type: 10, count: 1 kernel BUG at fs/f2fs/super.c:1639! CPU: 0 PID: 15451 Comm: syz-executor.1 Not tainted 6.5.0-syzkaller-09338-ge0152e7481c6 #0 RIP: 0010:f2fs_put_super+0xce1/0xed0 fs/f2fs/super.c:1639 Call Trace: generic_shutdown_super+0x161/0x3c0 fs/super.c:693 kill_block_super+0x3b/0x70 fs/super.c:1646 kill_f2fs_super+0x2b7/0x3d0 fs/f2fs/super.c:4879 deactivate_locked_super+0x9a/0x170 fs/super.c:481 deactivate_super+0xde/0x100 fs/super.c:514 cleanup_mnt+0x222/0x3d0 fs/namespace.c:1254 task_work_run+0x14d/0x240 kernel/task_work.c:179 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x210/0x240 kernel/entry/common.c:204 __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline] syscall_exit_to_user_mode+0x1d/0x60 kernel/entry/common.c:296 do_syscall_64+0x44/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x63/0xcd In f2fs_put_super(), it tries to do sanity check on dirty and IO reference count of f2fs, once there is any reference count leak, it will trigger panic. The root case is, during f2fs_put_super(), if there is any IO error in f2fs_wait_on_all_pages(), we missed to truncate meta_inode's page cache later, result in panic, fix this case. analysis: {"Container":"[使用F2FS文件系统的容器镜像]","CVE_Reason":"[容器的运行可能导致宿主机在卸载F2FS文件系统时因引用计数泄漏而触发内核崩溃,同时可能消耗大量CPU资源进行错误处理和日志记录]","CVE_Consequence":"[该CVE可能导致宿主机内核崩溃,严重程度高]"} cve: ./data/2023/52xxx/CVE-2023-52942.json In the Linux kernel, the following vulnerability has been resolved: cgroup/cpuset: Fix wrong check in update_parent_subparts_cpumask() It was found that the check to see if a partition could use up all the cpus from the parent cpuset in update_parent_subparts_cpumask() was incorrect. As a result, it is possible to leave parent with no effective cpu left even if there are tasks in the parent cpuset. This can lead to system panic as reported in [1]. Fix this probem by updating the check to fail the enabling the partition if parent's effective_cpus is a subset of the child's cpus_allowed. Also record the error code when an error happens in update_prstate() and add a test case where parent partition and child have the same cpu list and parent has task. Enabling partition in the child will fail in this case. [1] https://www.spinics.net/lists/cgroups/msg36254.html analysis: {"Container":"[任何使用受影响Linux内核版本的容器]","CVE_Reason":"[容器的运行可能会导致宿主机CPU资源分配错误,从而使某些任务无法获得CPU资源]","CVE_Consequence":"[该CVE可能导致宿主机系统崩溃或不可用,严重程度较高]"} cve: ./data/2024/0xxx/CVE-2024-0443.json A flaw was found in the blkgs destruction path in block/blk-cgroup.c in the Linux kernel, leading to a cgroup blkio memory leakage problem. When a cgroup is being destroyed, cgroup_rstat_flush() is only called at css_release_work_fn(), which is called when the blkcg reference count reaches 0. This circular dependency will prevent blkcg and some blkgs from being freed after they are made offline. This issue may allow an attacker with a local access to cause system instability, such as an out of memory error. analysis: {"Container":"[任何使用受影响Linux内核版本的容器镜像]","CVE_Reason":"[容器的运行会导致宿主机内存资源泄漏,无法正确释放已离线的blkcg和blkgs实例]",“CVE_Consequence”:“[该CVE可能导致系统内存耗尽,引发宿主机不稳定或崩溃,严重程度高]”} cve: ./data/2024/10xxx/CVE-2024-10846.json The compose-go library component in versions v2.10-v2.4.0 allows an authorized user who sends malicious YAML payloads to cause the compose-go to consume excessive amount of Memory and CPU cycles while parsing YAML, such as used by Docker Compose from versions v2.27.0 to v2.29.7 included analysis: {"Container":"Docker Compose","CVE_Reason":"容器的运行会导致宿主机CPU资源和内存资源被大量占用","CVE_Consequence":"该CVE可能导致恶意YAML文件在解析时耗尽宿主机的内存和CPU资源,严重程度为高"} cve: ./data/2024/23xxx/CVE-2024-23824.json mailcow is a dockerized email package, with multiple containers linked in one bridged network. The application is vulnerable to pixel flood attack, once the payload has been successfully uploaded in the logo the application goes slow and doesn't respond in the admin page. It is tested on the versions 2023-12a and prior and patched in version 2024-01. analysis: {"Container":"mailcow","CVE_Reason":"容器中的应用程序受到像素洪水攻击影响,导致CPU资源被大量占用","CVE_Consequence":"该CVE会导致宿主机CPU资源耗尽,应用响应缓慢甚至无响应,严重程度为高"} cve: ./data/2024/31xxx/CVE-2024-31994.json Mealie is a self hosted recipe manager and meal planner. Prior to 1.4.0, an attacker can point the image request to an arbitrarily large file. Mealie will attempt to retrieve this file in whole. If it can be retrieved, it may be stored on the file system in whole (leading to possible disk consumption), however the more likely scenario given resource limitations is that the container will OOM during file retrieval if the target file size is greater than the allocated memory of the container. At best this can be used to force the container to infinitely restart due to OOM (if so configured in `docker-compose.yml), or at worst this can be used to force the Mealie container to crash and remain offline. In the event that the file can be retrieved, the lack of rate limiting on this endpoint also permits an attacker to generate ongoing requests to any target of their choice, potentially contributing to an external-facing DoS attack. This vulnerability is fixed in 1.4.0. analysis: {"Container":"Mealie","CVE_Reason":"容器的运行可能会导致宿主机的内存资源被大量占用,甚至可能引发硬盘空间资源的消耗","CVE_Consequence":"该CVE可能导致容器因内存不足而崩溃或无限重启,严重程度较高"} cve: ./data/2024/35xxx/CVE-2024-35894.json In the Linux kernel, the following vulnerability has been resolved: mptcp: prevent BPF accessing lowat from a subflow socket. Alexei reported the following splat: WARNING: CPU: 32 PID: 3276 at net/mptcp/subflow.c:1430 subflow_data_ready+0x147/0x1c0 Modules linked in: dummy bpf_testmod(O) [last unloaded: bpf_test_no_cfi(O)] CPU: 32 PID: 3276 Comm: test_progs Tainted: GO 6.8.0-12873-g2c43c33bfd23 Call Trace: mptcp_set_rcvlowat+0x79/0x1d0 sk_setsockopt+0x6c0/0x1540 __bpf_setsockopt+0x6f/0x90 bpf_sock_ops_setsockopt+0x3c/0x90 bpf_prog_509ce5db2c7f9981_bpf_test_sockopt_int+0xb4/0x11b bpf_prog_dce07e362d941d2b_bpf_test_socket_sockopt+0x12b/0x132 bpf_prog_348c9b5faaf10092_skops_sockopt+0x954/0xe86 __cgroup_bpf_run_filter_sock_ops+0xbc/0x250 tcp_connect+0x879/0x1160 tcp_v6_connect+0x50c/0x870 mptcp_connect+0x129/0x280 __inet_stream_connect+0xce/0x370 inet_stream_connect+0x36/0x50 bpf_trampoline_6442491565+0x49/0xef inet_stream_connect+0x5/0x50 __sys_connect+0x63/0x90 __x64_sys_connect+0x14/0x20 The root cause of the issue is that bpf allows accessing mptcp-level proto_ops from a tcp subflow scope. Fix the issue detecting the problematic call and preventing any action. analysis: {"Container":"[使用Linux内核中mptcp功能的容器镜像]","CVE_Reason":"[容器中的BPF程序可能通过子流socket访问mptcp级别的proto_ops,导致CPU资源被大量占用]","CVE_Consequence":"[该CVE可能导致宿主机CPU资源耗尽,严重程度为高]"} cve: ./data/2024/35xxx/CVE-2024-35896.json In the Linux kernel, the following vulnerability has been resolved: netfilter: validate user input for expected length I got multiple syzbot reports showing old bugs exposed by BPF after commit 20f2505fb436 ("bpf: Try to avoid kzalloc in cgroup/{s,g}etsockopt") setsockopt() @optlen argument should be taken into account before copying data. BUG: KASAN: slab-out-of-bounds in copy_from_sockptr_offset include/linux/sockptr.h:49 [inline] BUG: KASAN: slab-out-of-bounds in copy_from_sockptr include/linux/sockptr.h:55 [inline] BUG: KASAN: slab-out-of-bounds in do_replace net/ipv4/netfilter/ip_tables.c:1111 [inline] BUG: KASAN: slab-out-of-bounds in do_ipt_set_ctl+0x902/0x3dd0 net/ipv4/netfilter/ip_tables.c:1627 Read of size 96 at addr ffff88802cd73da0 by task syz-executor.4/7238 CPU: 1 PID: 7238 Comm: syz-executor.4 Not tainted 6.9.0-rc2-next-20240403-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 Call Trace: __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 kasan_check_range+0x282/0x290 mm/kasan/generic.c:189 __asan_memcpy+0x29/0x70 mm/kasan/shadow.c:105 copy_from_sockptr_offset include/linux/sockptr.h:49 [inline] copy_from_sockptr include/linux/sockptr.h:55 [inline] do_replace net/ipv4/netfilter/ip_tables.c:1111 [inline] do_ipt_set_ctl+0x902/0x3dd0 net/ipv4/netfilter/ip_tables.c:1627 nf_setsockopt+0x295/0x2c0 net/netfilter/nf_sockopt.c:101 do_sock_setsockopt+0x3af/0x720 net/socket.c:2311 __sys_setsockopt+0x1ae/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x72/0x7a RIP: 0033:0x7fd22067dde9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fd21f9ff0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 00007fd2207abf80 RCX: 00007fd22067dde9 RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00007fd2206ca47a R08: 0000000000000001 R09: 0000000000000000 R10: 0000000020000880 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007fd2207abf80 R15: 00007ffd2d0170d8 Allocated by task 7238: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:370 [inline] __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387 kasan_kmalloc include/linux/kasan.h:211 [inline] __do_kmalloc_node mm/slub.c:4069 [inline] __kmalloc_noprof+0x200/0x410 mm/slub.c:4082 kmalloc_noprof include/linux/slab.h:664 [inline] __cgroup_bpf_run_filter_setsockopt+0xd47/0x1050 kernel/bpf/cgroup.c:1869 do_sock_setsockopt+0x6b4/0x720 net/socket.c:2293 __sys_setsockopt+0x1ae/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x72/0x7a The buggy address belongs to the object at ffff88802cd73da0 which belongs to the cache kmalloc-8 of size 8 The buggy address is located 0 bytes inside of allocated 1-byte region [ffff88802cd73da0, ffff88802cd73da1) The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88802cd73020 pfn:0x2cd73 flags: 0xfff80000000000(node=0|zone=1|lastcpupid=0xfff) page_type: 0xffffefff(slab) raw: 00fff80000000000 ffff888015041280 dead000000000100 dead000000000122 raw: ffff88802cd73020 000000008080007f 00000001ffffefff 00 ---truncated--- analysis: {"Container":"[使用netfilter功能的容器]","CVE_Reason":"[由于用户输入验证不足,可能导致攻击者通过特定的setsockopt调用消耗大量CPU资源]","CVE_Consequence":"[该漏洞可能允许攻击者触发CPU密集型操作,从而导致宿主机CPU资源耗尽,严重程度为高]"} cve: ./data/2024/35xxx/CVE-2024-35974.json In the Linux kernel, the following vulnerability has been resolved: block: fix q->blkg_list corruption during disk rebind Multiple gendisk instances can allocated/added for single request queue in case of disk rebind. blkg may still stay in q->blkg_list when calling blkcg_init_disk() for rebind, then q->blkg_list becomes corrupted. Fix the list corruption issue by: - add blkg_init_queue() to initialize q->blkg_list & q->blkcg_mutex only - move calling blkg_init_queue() into blk_alloc_queue() The list corruption should be started since commit f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") which delays removing blkg from q->blkg_list into blkg_free_workfn(). analysis: {"Container":"[使用受影响Linux内核的容器镜像]","CVE_Reason":"[容器的运行可能导致宿主机磁盘调度队列管理异常,进而引发CPU资源被大量占用以处理错误的队列状态]",“CVE_Consequence”:“[该CVE可能导致宿主机的磁盘I/O子系统性能下降,严重时可能引发宿主机整体性能瓶颈或资源耗尽,风险等级为高]”} cve: ./data/2024/35xxx/CVE-2024-35976.json In the Linux kernel, the following vulnerability has been resolved: xsk: validate user input for XDP_{UMEM|COMPLETION}_FILL_RING syzbot reported an illegal copy in xsk_setsockopt() [1] Make sure to validate setsockopt() @optlen parameter. [1] BUG: KASAN: slab-out-of-bounds in copy_from_sockptr_offset include/linux/sockptr.h:49 [inline] BUG: KASAN: slab-out-of-bounds in copy_from_sockptr include/linux/sockptr.h:55 [inline] BUG: KASAN: slab-out-of-bounds in xsk_setsockopt+0x909/0xa40 net/xdp/xsk.c:1420 Read of size 4 at addr ffff888028c6cde3 by task syz-executor.0/7549 CPU: 0 PID: 7549 Comm: syz-executor.0 Not tainted 6.8.0-syzkaller-08951-gfe46a7dd189e #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 Call Trace: __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 copy_from_sockptr_offset include/linux/sockptr.h:49 [inline] copy_from_sockptr include/linux/sockptr.h:55 [inline] xsk_setsockopt+0x909/0xa40 net/xdp/xsk.c:1420 do_sock_setsockopt+0x3af/0x720 net/socket.c:2311 __sys_setsockopt+0x1ae/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6d/0x75 RIP: 0033:0x7fb40587de69 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fb40665a0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 00007fb4059abf80 RCX: 00007fb40587de69 RDX: 0000000000000005 RSI: 000000000000011b RDI: 0000000000000006 RBP: 00007fb4058ca47a R08: 0000000000000002 R09: 0000000000000000 R10: 0000000020001980 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007fb4059abf80 R15: 00007fff57ee4d08 Allocated by task 7549: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:370 [inline] __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387 kasan_kmalloc include/linux/kasan.h:211 [inline] __do_kmalloc_node mm/slub.c:3966 [inline] __kmalloc+0x233/0x4a0 mm/slub.c:3979 kmalloc include/linux/slab.h:632 [inline] __cgroup_bpf_run_filter_setsockopt+0xd2f/0x1040 kernel/bpf/cgroup.c:1869 do_sock_setsockopt+0x6b4/0x720 net/socket.c:2293 __sys_setsockopt+0x1ae/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6d/0x75 The buggy address belongs to the object at ffff888028c6cde0 which belongs to the cache kmalloc-8 of size 8 The buggy address is located 1 bytes to the right of allocated 2-byte region [ffff888028c6cde0, ffff888028c6cde2) The buggy address belongs to the physical page: page:ffffea0000a31b00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888028c6c9c0 pfn:0x28c6c anon flags: 0xfff00000000800(slab|node=0|zone=1|lastcpupid=0x7ff) page_type: 0xffffffff() raw: 00fff00000000800 ffff888014c41280 0000000000000000 dead000000000001 raw: ffff888028c6c9c0 0000000080800057 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112cc0(GFP_USER|__GFP_NOWARN|__GFP_NORETRY), pid 6648, tgid 6644 (syz-executor.0), ts 133906047828, free_ts 133859922223 set_page_owner include/linux/page_owner.h:31 [inline] post_alloc_hook+0x1ea/0x210 mm/page_alloc.c:1533 prep_new_page mm/page_alloc.c: ---truncated--- analysis: {"Container":"[任何使用Linux内核XDP功能的容器]","CVE_Reason":"[由于未正确验证用户输入,攻击者可能利用此漏洞触发无限循环或大量分配内存,导致CPU资源被大量占用或内存耗尽]","CVE_Consequence":"[该CVE可能导致宿主机CPU和内存资源耗尽,严重程度为高]"} cve: ./data/2024/38xxx/CVE-2024-38384.json In the Linux kernel, the following vulnerability has been resolved: blk-cgroup: fix list corruption from reorder of WRITE ->lqueued __blkcg_rstat_flush() can be run anytime, especially when blk_cgroup_bio_start is being executed. If WRITE of `->lqueued` is re-ordered with READ of 'bisc->lnode.next' in the loop of __blkcg_rstat_flush(), `next_bisc` can be assigned with one stat instance being added in blk_cgroup_bio_start(), then the local list in __blkcg_rstat_flush() could be corrupted. Fix the issue by adding one barrier. analysis: {"Container":"[使用受影响Linux内核的任何容器]","CVE_Reason":"[容器的运行可能因块设备控制组(blk-cgroup)统计刷新逻辑中的竞态条件,导致宿主机CPU资源被大量占用以处理错误的列表遍历和修复操作]","CVE_Consequence":"[该CVE可能导致宿主机上的CPU使用率异常升高,严重程度为中高,具体影响取决于块设备控制组的使用频率]"} cve: ./data/2024/38xxx/CVE-2024-38564.json In the Linux kernel, the following vulnerability has been resolved: bpf: Add BPF_PROG_TYPE_CGROUP_SKB attach type enforcement in BPF_LINK_CREATE bpf_prog_attach uses attach_type_to_prog_type to enforce proper attach type for BPF_PROG_TYPE_CGROUP_SKB. link_create uses bpf_prog_get and relies on bpf_prog_attach_check_attach_type to properly verify prog_type <> attach_type association. Add missing attach_type enforcement for the link_create case. Otherwise, it's currently possible to attach cgroup_skb prog types to other cgroup hooks. analysis: {"Container":"[任何使用BPF_PROG_TYPE_CGROUP_SKB的容器镜像]","CVE_Reason":"[容器中的BPF程序可能被错误地附加到其他cgroup钩子,导致CPU资源被大量占用]","CVE_Consequence":"[该CVE可能导致恶意或错误配置的容器消耗宿主机过多的CPU资源,严重程度为高]"} cve: ./data/2024/38xxx/CVE-2024-38663.json In the Linux kernel, the following vulnerability has been resolved: blk-cgroup: fix list corruption from resetting io stat Since commit 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()"), each iostat instance is added to blkcg percpu list, so blkcg_reset_stats() can't reset the stat instance by memset(), otherwise the llist may be corrupted. Fix the issue by only resetting the counter part. analysis: {"Container":"[任何使用受影响Linux内核版本的容器]","CVE_Reason":"[容器的运行可能会导致宿主机块设备控制组(blk-cgroup)统计信息重置异常,从而间接影响CPU资源调度效率]","CVE_Consequence":"[该CVE可能导致宿主机块设备统计信息列表损坏,进而影响I/O性能和资源分配效率,严重程度为中等]"} cve: ./data/2024/3xxx/CVE-2024-3056.json A flaw was found in Podman. This issue may allow an attacker to create a specially crafted container that, when configured to share the same IPC with at least one other container, can create a large number of IPC resources in /dev/shm. The malicious container will continue to exhaust resources until it is out-of-memory (OOM) killed. While the malicious container's cgroup will be removed, the IPC resources it created are not. Those resources are tied to the IPC namespace that will not be removed until all containers using it are stopped, and one non-malicious container is holding the namespace open. The malicious container is restarted, either automatically or by attacker control, repeating the process and increasing the amount of memory consumed. With a container configured to restart always, such as `podman run --restart=always`, this can result in a memory-based denial of service of the system. analysis: ```json {"Container":"Podman","CVE_Reason":"容器的运行会导致宿主机内存资源被大量占用","CVE_Consequence":"恶意容器通过创建大量IPC资源导致内存消耗不断增加,最终可能引发宿主机的内存耗尽型拒绝服务攻击(DoS),严重程度较高"} ``` cve: ./data/2024/40xxx/CVE-2024-40949.json In the Linux kernel, the following vulnerability has been resolved: mm: shmem: fix getting incorrect lruvec when replacing a shmem folio When testing shmem swapin, I encountered the warning below on my machine. The reason is that replacing an old shmem folio with a new one causes mem_cgroup_migrate() to clear the old folio's memcg data. As a result, the old folio cannot get the correct memcg's lruvec needed to remove itself from the LRU list when it is being freed. This could lead to possible serious problems, such as LRU list crashes due to holding the wrong LRU lock, and incorrect LRU statistics. To fix this issue, we can fallback to use the mem_cgroup_replace_folio() to replace the old shmem folio. [ 5241.100311] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5d9960 [ 5241.100317] head: order:4 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 5241.100319] flags: 0x17fffe0000040068(uptodate|lru|head|swapbacked|node=0|zone=2|lastcpupid=0x3ffff) [ 5241.100323] raw: 17fffe0000040068 fffffdffd6687948 fffffdffd69ae008 0000000000000000 [ 5241.100325] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 5241.100326] head: 17fffe0000040068 fffffdffd6687948 fffffdffd69ae008 0000000000000000 [ 5241.100327] head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 5241.100328] head: 17fffe0000000204 fffffdffd6665801 ffffffffffffffff 0000000000000000 [ 5241.100329] head: 0000000a00000010 0000000000000000 00000000ffffffff 0000000000000000 [ 5241.100330] page dumped because: VM_WARN_ON_ONCE_FOLIO(!memcg && !mem_cgroup_disabled()) [ 5241.100338] ------------[ cut here ]------------ [ 5241.100339] WARNING: CPU: 19 PID: 78402 at include/linux/memcontrol.h:775 folio_lruvec_lock_irqsave+0x140/0x150 [...] [ 5241.100374] pc : folio_lruvec_lock_irqsave+0x140/0x150 [ 5241.100375] lr : folio_lruvec_lock_irqsave+0x138/0x150 [ 5241.100376] sp : ffff80008b38b930 [...] [ 5241.100398] Call trace: [ 5241.100399] folio_lruvec_lock_irqsave+0x140/0x150 [ 5241.100401] __page_cache_release+0x90/0x300 [ 5241.100404] __folio_put+0x50/0x108 [ 5241.100406] shmem_replace_folio+0x1b4/0x240 [ 5241.100409] shmem_swapin_folio+0x314/0x528 [ 5241.100411] shmem_get_folio_gfp+0x3b4/0x930 [ 5241.100412] shmem_fault+0x74/0x160 [ 5241.100414] __do_fault+0x40/0x218 [ 5241.100417] do_shared_fault+0x34/0x1b0 [ 5241.100419] do_fault+0x40/0x168 [ 5241.100420] handle_pte_fault+0x80/0x228 [ 5241.100422] __handle_mm_fault+0x1c4/0x440 [ 5241.100424] handle_mm_fault+0x60/0x1f0 [ 5241.100426] do_page_fault+0x120/0x488 [ 5241.100429] do_translation_fault+0x4c/0x68 [ 5241.100431] do_mem_abort+0x48/0xa0 [ 5241.100434] el0_da+0x38/0xc0 [ 5241.100436] el0t_64_sync_handler+0x68/0xc0 [ 5241.100437] el0t_64_sync+0x14c/0x150 [ 5241.100439] ---[ end trace 0000000000000000 ]--- [baolin.wang@linux.alibaba.com: remove less helpful comments, per Matthew] analysis: {"Container":"[任何使用受影响Linux内核版本的容器镜像]","CVE_Reason":"[容器内的内存管理操作可能导致宿主机LRU列表崩溃,进而引发内存资源管理混乱]","CVE_Consequence":"[该CVE可能会导致宿主机内存资源分配错误,严重时可能使宿主机内存资源耗尽,高危]"} cve: ./data/2024/42xxx/CVE-2024-42297.json In the Linux kernel, the following vulnerability has been resolved: f2fs: fix to don't dirty inode for readonly filesystem syzbot reports f2fs bug as below: kernel BUG at fs/f2fs/inode.c:933! RIP: 0010:f2fs_evict_inode+0x1576/0x1590 fs/f2fs/inode.c:933 Call Trace: evict+0x2a4/0x620 fs/inode.c:664 dispose_list fs/inode.c:697 [inline] evict_inodes+0x5f8/0x690 fs/inode.c:747 generic_shutdown_super+0x9d/0x2c0 fs/super.c:675 kill_block_super+0x44/0x90 fs/super.c:1667 kill_f2fs_super+0x303/0x3b0 fs/f2fs/super.c:4894 deactivate_locked_super+0xc1/0x130 fs/super.c:484 cleanup_mnt+0x426/0x4c0 fs/namespace.c:1256 task_work_run+0x24a/0x300 kernel/task_work.c:180 ptrace_notify+0x2cd/0x380 kernel/signal.c:2399 ptrace_report_syscall include/linux/ptrace.h:411 [inline] ptrace_report_syscall_exit include/linux/ptrace.h:473 [inline] syscall_exit_work kernel/entry/common.c:251 [inline] syscall_exit_to_user_mode_prepare kernel/entry/common.c:278 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:283 [inline] syscall_exit_to_user_mode+0x15c/0x280 kernel/entry/common.c:296 do_syscall_64+0x50/0x110 arch/x86/entry/common.c:88 entry_SYSCALL_64_after_hwframe+0x63/0x6b The root cause is: - do_sys_open - f2fs_lookup - __f2fs_find_entry - f2fs_i_depth_write - f2fs_mark_inode_dirty_sync - f2fs_dirty_inode - set_inode_flag(inode, FI_DIRTY_INODE) - umount - kill_f2fs_super - kill_block_super - generic_shutdown_super - sync_filesystem : sb is readonly, skip sync_filesystem() - evict_inodes - iput - f2fs_evict_inode - f2fs_bug_on(sbi, is_inode_flag_set(inode, FI_DIRTY_INODE)) : trigger kernel panic When we try to repair i_current_depth in readonly filesystem, let's skip dirty inode to avoid panic in later f2fs_evict_inode(). analysis: {"Container":"[使用F2FS文件系统的容器镜像]","CVE_Reason":"[容器在只读F2FS文件系统上运行时,由于内核漏洞可能导致大量inode标记为脏数据,进而引发频繁的同步操作,消耗宿主机CPU资源]","CVE_Consequence":"[该CVE可能引发内核崩溃或系统不稳定,严重程度较高]"} cve: ./data/2024/43xxx/CVE-2024-43853.json In the Linux kernel, the following vulnerability has been resolved: cgroup/cpuset: Prevent UAF in proc_cpuset_show() An UAF can happen when /proc/cpuset is read as reported in [1]. This can be reproduced by the following methods: 1.add an mdelay(1000) before acquiring the cgroup_lock In the cgroup_path_ns function. 2.$cat /proc//cpuset repeatly. 3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/ $umount /sys/fs/cgroup/cpuset/ repeatly. The race that cause this bug can be shown as below: (umount) | (cat /proc//cpuset) css_release | proc_cpuset_show css_release_work_fn | css = task_get_css(tsk, cpuset_cgrp_id); css_free_rwork_fn | cgroup_path_ns(css->cgroup, ...); cgroup_destroy_root | mutex_lock(&cgroup_mutex); rebind_subsystems | cgroup_free_root | | // cgrp was freed, UAF | cgroup_path_ns_locked(cgrp,..); When the cpuset is initialized, the root node top_cpuset.css.cgrp will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount operation will allocate cgroup_root, and top_cpuset.css.cgrp will point to the allocated &cgroup_root.cgrp. When the umount operation is executed, top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp. The problem is that when rebinding to cgrp_dfl_root, there are cases where the cgroup_root allocated by setting up the root for cgroup v1 is cached. This could lead to a Use-After-Free (UAF) if it is subsequently freed. The descendant cgroups of cgroup v1 can only be freed after the css is released. However, the css of the root will never be released, yet the cgroup_root should be freed when it is unmounted. This means that obtaining a reference to the css of the root does not guarantee that css.cgrp->root will not be freed. Fix this problem by using rcu_read_lock in proc_cpuset_show(). As cgroup_root is kfree_rcu after commit d23b5c577715 ("cgroup: Make operations on the cgroup root_list RCU safe"), css->cgroup won't be freed during the critical section. To call cgroup_path_ns_locked, css_set_lock is needed, so it is safe to replace task_get_css with task_css. [1] https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd analysis: {"Container":"[使用cpuset控制组的Linux容器]","CVE_Reason":"[容器的运行可能导致宿主机CPU资源被大量占用,因为Use-After-Free漏洞可能引发竞争条件,导致系统不稳定并消耗额外计算资源]","CVE_Consequence":"[该CVE可能导致宿主机上的容器管理功能异常,严重时可导致系统崩溃或需要重启。严重程度:高]"} cve: ./data/2024/43xxx/CVE-2024-43892.json In the Linux kernel, the following vulnerability has been resolved: memcg: protect concurrent access to mem_cgroup_idr Commit 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") decoupled the memcg IDs from the CSS ID space to fix the cgroup creation failures. It introduced IDR to maintain the memcg ID space. The IDR depends on external synchronization mechanisms for modifications. For the mem_cgroup_idr, the idr_alloc() and idr_replace() happen within css callback and thus are protected through cgroup_mutex from concurrent modifications. However idr_remove() for mem_cgroup_idr was not protected against concurrency and can be run concurrently for different memcgs when they hit their refcnt to zero. Fix that. We have been seeing list_lru based kernel crashes at a low frequency in our fleet for a long time. These crashes were in different part of list_lru code including list_lru_add(), list_lru_del() and reparenting code. Upon further inspection, it looked like for a given object (dentry and inode), the super_block's list_lru didn't have list_lru_one for the memcg of that object. The initial suspicions were either the object is not allocated through kmem_cache_alloc_lru() or somehow memcg_list_lru_alloc() failed to allocate list_lru_one() for a memcg but returned success. No evidence were found for these cases. Looking more deeply, we started seeing situations where valid memcg's id is not present in mem_cgroup_idr and in some cases multiple valid memcgs have same id and mem_cgroup_idr is pointing to one of them. So, the most reasonable explanation is that these situations can happen due to race between multiple idr_remove() calls or race between idr_alloc()/idr_replace() and idr_remove(). These races are causing multiple memcgs to acquire the same ID and then offlining of one of them would cleanup list_lrus on the system for all of them. Later access from other memcgs to the list_lru cause crashes due to missing list_lru_one. analysis: {"Container":"[使用受影响Linux内核的任何容器镜像]","CVE_Reason":"[容器中的并发访问问题可能导致内存资源分配错误,进而引发系统不稳定和资源消耗异常]","CVE_Consequence":"[该CVE可能导致宿主机上的容器内存管理出现竞争条件,从而引发内核崩溃或资源清理错误。严重程度为高,因为可能影响整个宿主机的稳定性]"} cve: ./data/2024/49xxx/CVE-2024-49974.json In the Linux kernel, the following vulnerability has been resolved: NFSD: Limit the number of concurrent async COPY operations Nothing appears to limit the number of concurrent async COPY operations that clients can start. In addition, AFAICT each async COPY can copy an unlimited number of 4MB chunks, so can run for a long time. Thus IMO async COPY can become a DoS vector. Add a restriction mechanism that bounds the number of concurrent background COPY operations. Start simple and try to be fair -- this patch implements a per-namespace limit. An async COPY request that occurs while this limit is exceeded gets NFS4ERR_DELAY. The requesting client can choose to send the request again after a delay or fall back to a traditional read/write style copy. If there is need to make the mechanism more sophisticated, we can visit that in future patches. analysis: {"Container":"[使用Linux内核NFS服务的容器镜像]","CVE_Reason":"[容器中的NFS客户端可以通过大量并发异步COPY操作,导致宿主机CPU和内存资源被过度消耗]","CVE_Consequence":"[该CVE可能导致拒绝服务攻击,严重程度为高]"} cve: ./data/2024/50xxx/CVE-2024-50130.json In the Linux kernel, the following vulnerability has been resolved: netfilter: bpf: must hold reference on net namespace BUG: KASAN: slab-use-after-free in __nf_unregister_net_hook+0x640/0x6b0 Read of size 8 at addr ffff8880106fe400 by task repro/72= bpf_nf_link_release+0xda/0x1e0 bpf_link_free+0x139/0x2d0 bpf_link_release+0x68/0x80 __fput+0x414/0xb60 Eric says: It seems that bpf was able to defer the __nf_unregister_net_hook() after exit()/close() time. Perhaps a netns reference is missing, because the netns has been dismantled/freed already. bpf_nf_link_attach() does : link->net = net; But I do not see a reference being taken on net. Add such a reference and release it after hook unreg. Note that I was unable to get syzbot reproducer to work, so I do not know if this resolves this splat. analysis: {"Container":"[任何使用受影响Linux内核版本的容器镜像]","CVE_Reason":"[由于netfilter的BPF程序未正确持有网络命名空间的引用,可能导致use-after-free问题,从而引发宿主机内核不稳定或崩溃,同时可能间接导致资源消耗异常]","CVE_Consequence":"[该CVE可能导致内核内存损坏,进而造成系统崩溃或性能下降,严重程度较高]"} cve: ./data/2024/53xxx/CVE-2024-53168.json In the Linux kernel, the following vulnerability has been resolved: sunrpc: fix one UAF issue caused by sunrpc kernel tcp socket BUG: KASAN: slab-use-after-free in tcp_write_timer_handler+0x156/0x3e0 Read of size 1 at addr ffff888111f322cd by task swapper/0/0 CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.0-rc4-dirty #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 Call Trace: dump_stack_lvl+0x68/0xa0 print_address_description.constprop.0+0x2c/0x3d0 print_report+0xb4/0x270 kasan_report+0xbd/0xf0 tcp_write_timer_handler+0x156/0x3e0 tcp_write_timer+0x66/0x170 call_timer_fn+0xfb/0x1d0 __run_timers+0x3f8/0x480 run_timer_softirq+0x9b/0x100 handle_softirqs+0x153/0x390 __irq_exit_rcu+0x103/0x120 irq_exit_rcu+0xe/0x20 sysvec_apic_timer_interrupt+0x76/0x90 asm_sysvec_apic_timer_interrupt+0x1a/0x20 RIP: 0010:default_idle+0xf/0x20 Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 33 f8 25 00 fb f4 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 RSP: 0018:ffffffffa2007e28 EFLAGS: 00000242 RAX: 00000000000f3b31 RBX: 1ffffffff4400fc7 RCX: ffffffffa09c3196 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9f00590f RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed102360835d R10: ffff88811b041aeb R11: 0000000000000001 R12: 0000000000000000 R13: ffffffffa202d7c0 R14: 0000000000000000 R15: 00000000000147d0 default_idle_call+0x6b/0xa0 cpuidle_idle_call+0x1af/0x1f0 do_idle+0xbc/0x130 cpu_startup_entry+0x33/0x40 rest_init+0x11f/0x210 start_kernel+0x39a/0x420 x86_64_start_reservations+0x18/0x30 x86_64_start_kernel+0x97/0xa0 common_startup_64+0x13e/0x141 Allocated by task 595: kasan_save_stack+0x24/0x50 kasan_save_track+0x14/0x30 __kasan_slab_alloc+0x87/0x90 kmem_cache_alloc_noprof+0x12b/0x3f0 copy_net_ns+0x94/0x380 create_new_namespaces+0x24c/0x500 unshare_nsproxy_namespaces+0x75/0xf0 ksys_unshare+0x24e/0x4f0 __x64_sys_unshare+0x1f/0x30 do_syscall_64+0x70/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e Freed by task 100: kasan_save_stack+0x24/0x50 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x54/0x70 kmem_cache_free+0x156/0x5d0 cleanup_net+0x5d3/0x670 process_one_work+0x776/0xa90 worker_thread+0x2e2/0x560 kthread+0x1a8/0x1f0 ret_from_fork+0x34/0x60 ret_from_fork_asm+0x1a/0x30 Reproduction script: mkdir -p /mnt/nfsshare mkdir -p /mnt/nfs/netns_1 mkfs.ext4 /dev/sdb mount /dev/sdb /mnt/nfsshare systemctl restart nfs-server chmod 777 /mnt/nfsshare exportfs -i -o rw,no_root_squash *:/mnt/nfsshare ip netns add netns_1 ip link add name veth_1_peer type veth peer veth_1 ifconfig veth_1_peer 11.11.0.254 up ip link set veth_1 netns netns_1 ip netns exec netns_1 ifconfig veth_1 11.11.0.1 ip netns exec netns_1 /root/iptables -A OUTPUT -d 11.11.0.254 -p tcp \ --tcp-flags FIN FIN -j DROP (note: In my environment, a DESTROY_CLIENTID operation is always sent immediately, breaking the nfs tcp connection.) ip netns exec netns_1 timeout -s 9 300 mount -t nfs -o proto=tcp,vers=4.1 \ 11.11.0.254:/mnt/nfsshare /mnt/nfs/netns_1 ip netns del netns_1 The reason here is that the tcp socket in netns_1 (nfs side) has been shutdown and closed (done in xs_destroy), but the FIN message (with ack) is discarded, and the nfsd side keeps sending retransmission messages. As a result, when the tcp sock in netns_1 processes the received message, it sends the message (FIN message) in the sending queue, and the tcp timer is re-established. When the network namespace is deleted, the net structure accessed by tcp's timer handler function causes problems. To fix this problem, let's hold netns refcnt for the tcp kernel socket as done in other modules. This is an ugly hack which can easily be backported to earlier kernels. A proper fix which cleans up the interfaces will follow, but may not be so easy to backport. analysis: {"Container":"[使用NFS挂载的容器]","CVE_Reason":"[容器内的NFS客户端操作可能导致宿主机TCP资源泄漏,进而导致CPU资源被大量占用]","CVE_Consequence":"[该CVE可能造成宿主机网络命名空间删除时出现资源访问问题,导致CPU使用率异常升高,严重程度为高]"} cve: ./data/2024/53xxx/CVE-2024-53175.json In the Linux kernel, the following vulnerability has been resolved: ipc: fix memleak if msg_init_ns failed in create_ipc_ns Percpu memory allocation may failed during create_ipc_ns however this fail is not handled properly since ipc sysctls and mq sysctls is not released properly. Fix this by release these two resource when failure. Here is the kmemleak stack when percpu failed: unreferenced object 0xffff88819de2a600 (size 512): comm "shmem_2nstest", pid 120711, jiffies 4300542254 hex dump (first 32 bytes): 60 aa 9d 84 ff ff ff ff fc 18 48 b2 84 88 ff ff `.........H..... 04 00 00 00 a4 01 00 00 20 e4 56 81 ff ff ff ff ........ .V..... backtrace (crc be7cba35): [] __kmalloc_node_track_caller_noprof+0x333/0x420 [] kmemdup_noprof+0x26/0x50 [] setup_mq_sysctls+0x57/0x1d0 [] copy_ipcs+0x29c/0x3b0 [] create_new_namespaces+0x1d0/0x920 [] copy_namespaces+0x2e9/0x3e0 [] copy_process+0x29f3/0x7ff0 [] kernel_clone+0xc0/0x650 [] __do_sys_clone+0xa1/0xe0 [] do_syscall_64+0xbf/0x1c0 [] entry_SYSCALL_64_after_hwframe+0x4b/0x53 analysis: {"Container":"[任何使用受影响Linux内核版本的容器镜像]","CVE_Reason":"[由于ipc命名空间创建时的内存泄漏问题,可能导致容器运行时消耗过多内存资源]","CVE_Consequence":"[该CVE可能会导致宿主机内存资源被逐渐耗尽,严重程度为高]"} cve: ./data/2024/56xxx/CVE-2024-56644.json In the Linux kernel, the following vulnerability has been resolved: net/ipv6: release expired exception dst cached in socket Dst objects get leaked in ip6_negative_advice() when this function is executed for an expired IPv6 route located in the exception table. There are several conditions that must be fulfilled for the leak to occur: * an ICMPv6 packet indicating a change of the MTU for the path is received, resulting in an exception dst being created * a TCP connection that uses the exception dst for routing packets must start timing out so that TCP begins retransmissions * after the exception dst expires, the FIB6 garbage collector must not run before TCP executes ip6_negative_advice() for the expired exception dst When TCP executes ip6_negative_advice() for an exception dst that has expired and if no other socket holds a reference to the exception dst, the refcount of the exception dst is 2, which corresponds to the increment made by dst_init() and the increment made by the TCP socket for which the connection is timing out. The refcount made by the socket is never released. The refcount of the dst is decremented in sk_dst_reset() but that decrement is counteracted by a dst_hold() intentionally placed just before the sk_dst_reset() in ip6_negative_advice(). After ip6_negative_advice() has finished, there is no other object tied to the dst. The socket lost its reference stored in sk_dst_cache and the dst is no longer in the exception table. The exception dst becomes a leaked object. As a result of this dst leak, an unbalanced refcount is reported for the loopback device of a net namespace being destroyed under kernels that do not contain e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev"): unregister_netdevice: waiting for lo to become free. Usage count = 2 Fix the dst leak by removing the dst_hold() in ip6_negative_advice(). The patch that introduced the dst_hold() in ip6_negative_advice() was 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22 merely refactored the code with regards to the dst refcount so the issue was present even before 92f1655aa2b22. The bug was introduced in 54c1a859efd9f ("ipv6: Don't drop cache route entry unless timer actually expired.") where the expired cached route is deleted and the sk_dst_cache member of the socket is set to NULL by calling dst_negative_advice() but the refcount belonging to the socket is left unbalanced. The IPv4 version - ipv4_negative_advice() - is not affected by this bug. When the TCP connection times out ipv4_negative_advice() merely resets the sk_dst_cache of the socket while decrementing the refcount of the exception dst. analysis: {"Container":"[任何使用受影响Linux内核版本的容器镜像]","CVE_Reason":"[容器中的IPv6路由异常处理会导致宿主机内存资源泄漏,因为dst对象引用计数未正确释放]","CVE_Consequence":"[该CVE会导致宿主机内存资源逐渐耗尽,严重程度为高]"} cve: ./data/2024/57xxx/CVE-2024-57977.json In the Linux kernel, the following vulnerability has been resolved: memcg: fix soft lockup in the OOM process A soft lockup issue was found in the product with about 56,000 tasks were in the OOM cgroup, it was traversing them when the soft lockup was triggered. watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [VM Thread:1503066] CPU: 2 PID: 1503066 Comm: VM Thread Kdump: loaded Tainted: G Hardware name: Huawei Cloud OpenStack Nova, BIOS RIP: 0010:console_unlock+0x343/0x540 RSP: 0000:ffffb751447db9a0 EFLAGS: 00000247 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00000000ffffffff RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000247 RBP: ffffffffafc71f90 R08: 0000000000000000 R09: 0000000000000040 R10: 0000000000000080 R11: 0000000000000000 R12: ffffffffafc74bd0 R13: ffffffffaf60a220 R14: 0000000000000247 R15: 0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f2fe6ad91f0 CR3: 00000004b2076003 CR4: 0000000000360ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: vprintk_emit+0x193/0x280 printk+0x52/0x6e dump_task+0x114/0x130 mem_cgroup_scan_tasks+0x76/0x100 dump_header+0x1fe/0x210 oom_kill_process+0xd1/0x100 out_of_memory+0x125/0x570 mem_cgroup_out_of_memory+0xb5/0xd0 try_charge+0x720/0x770 mem_cgroup_try_charge+0x86/0x180 mem_cgroup_try_charge_delay+0x1c/0x40 do_anonymous_page+0xb5/0x390 handle_mm_fault+0xc4/0x1f0 This is because thousands of processes are in the OOM cgroup, it takes a long time to traverse all of them. As a result, this lead to soft lockup in the OOM process. To fix this issue, call 'cond_resched' in the 'mem_cgroup_scan_tasks' function per 1000 iterations. For global OOM, call 'touch_softlockup_watchdog' per 1000 iterations to avoid this issue. analysis: {"Container":"[任何使用Linux内核并启用memcg的容器镜像]","CVE_Reason":"[容器中的大量进程处于OOM cgroup时,扫描这些进程可能导致CPU软锁定,从而大量消耗CPU资源]","CVE_Consequence":"[该CVE可能导致宿主机CPU资源被长时间占用,严重情况下影响宿主机上其他任务的正常运行,严重程度为高]"} cve: ./data/2024/5xxx/CVE-2024-5652.json In Docker Desktop on Windows before v4.31.0 allows a user in the docker-users group to cause a Windows Denial-of-Service through the exec-path Docker daemon config option in Windows containers mode. analysis: {"Container":"Docker Desktop Windows容器","CVE_Reason":"通过在Windows容器模式下使用exec-path Docker守护程序配置选项,攻击者可触发Windows服务拒绝(Denial-of-Service),可能导致宿主机资源耗尽","CVE_Consequence":"该CVE可能造成宿主机上的Docker服务不可用,同时大量消耗CPU或内存资源,严重程度较高"} cve: ./data/2024/7xxx/CVE-2024-7771.json A vulnerability in the Dockerized version of mintplex-labs/anything-llm (latest, digest 1d9452da2b92) allows for a denial of service. Uploading an audio file with a very low sample rate causes the functionality responsible for transcribing it to crash the entire site instance. The issue arises from the localWhisper implementation, where resampling the audio file from 1 Hz to 16000 Hz quickly exceeds available memory, leading to the Docker instance being killed by the instance manager. analysis: {"Container":"mintplex-labs/anything-llm","CVE_Reason":"容器在处理低采样率音频文件时,由于需要将其从1 Hz重采样到16000 Hz,导致内存资源被大量消耗,最终超过可用内存限制。","CVE_Consequence":"该CVE会导致容器因内存不足而被宿主机的实例管理器杀死,严重程度较高,可能引发服务中断。"} cve: ./data/2024/9xxx/CVE-2024-9407.json A vulnerability exists in the bind-propagation option of the Dockerfile RUN --mount instruction. The system does not properly validate the input passed to this option, allowing users to pass arbitrary parameters to the mount instruction. This issue can be exploited to mount sensitive directories from the host into a container during the build process and, in some cases, modify the contents of those mounted files. Even if SELinux is used, this vulnerability can bypass its protection by allowing the source directory to be relabeled to give the container access to host files. analysis: {"Container":"[使用Dockerfile构建的容器镜像]","CVE_Reason":"[攻击者可以通过此漏洞将宿主机的敏感目录挂载到容器中,并可能修改这些文件,从而导致容器消耗更多宿主机的物理资源,例如硬盘空间资源]","CVE_Consequence":"[该CVE可能导致宿主机文件被非法访问和修改,严重程度为高]"} cve: ./data/2024/9xxx/CVE-2024-9676.json A vulnerability was found in Podman, Buildah, and CRI-O. A symlink traversal vulnerability in the containers/storage library can cause Podman, Buildah, and CRI-O to hang and result in a denial of service via OOM kill when running a malicious image using an automatically assigned user namespace (`--userns=auto` in Podman and Buildah). The containers/storage library will read /etc/passwd inside the container, but does not properly validate if that file is a symlink, which can be used to cause the library to read an arbitrary file on the host. analysis: {"Container":"Podman, Buildah, CRI-O","CVE_Reason":"恶意容器镜像利用符号链接遍历漏洞,可能导致宿主机内存资源被大量消耗,最终触发OOM(Out of Memory)Kill。","CVE_Consequence":"该CVE会导致拒绝服务攻击,严重程度较高。"} cve: ./data/2025/21xxx/CVE-2025-21745.json In the Linux kernel, the following vulnerability has been resolved: blk-cgroup: Fix class @block_class's subsystem refcount leakage blkcg_fill_root_iostats() iterates over @block_class's devices by class_dev_iter_(init|next)(), but does not end iterating with class_dev_iter_exit(), so causes the class's subsystem refcount leakage. Fix by ending the iterating with class_dev_iter_exit(). analysis: {"Container":"[使用受影响Linux内核的任何容器镜像]","CVE_Reason":"[容器的运行会导致宿主机内存资源泄漏,因为子系统引用计数没有正确释放]","CVE_Consequence":"[该CVE可能导致宿主机内存资源逐渐耗尽,严重程度为高]"} cve: ./data/2025/21xxx/CVE-2025-21860.json In the Linux kernel, the following vulnerability has been resolved: mm/zswap: fix inconsistency when zswap_store_page() fails Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") skips charging any zswap entries when it failed to zswap the entire folio. However, when some base pages are zswapped but it failed to zswap the entire folio, the zswap operation is rolled back. When freeing zswap entries for those pages, zswap_entry_free() uncharges the zswap entries that were not previously charged, causing zswap charging to become inconsistent. This inconsistency triggers two warnings with following steps: # On a machine with 64GiB of RAM and 36GiB of zswap $ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng $ sudo reboot The two warnings are: in mm/memcontrol.c:163, function obj_cgroup_release(): WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); in mm/page_counter.c:60, function page_counter_cancel(): if (WARN_ONCE(new < 0, "page_counter underflow: %ld nr_pages=%lu\n", new, nr_pages)) zswap_stored_pages also becomes inconsistent in the same way. As suggested by Kanchana, increment zswap_stored_pages and charge zswap entries within zswap_store_page() when it succeeds. This way, zswap_entry_free() will decrement the counter and uncharge the entries when it failed to zswap the entire folio. While this could potentially be optimized by batching objcg charging and incrementing the counter, let's focus on fixing the bug this time and leave the optimization for later after some evaluation. After resolving the inconsistency, the warnings disappear. [42.hyeyoo@gmail.com: refactor zswap_store_page()] analysis: {"Container":"[任何使用Linux内核并启用zswap功能的容器]","CVE_Reason":"[容器中的应用程序如果触发大量内存交换操作,可能导致zswap功能异常,进而引发宿主机内存资源计算不一致,可能间接导致更多内存资源被占用]","CVE_Consequence":"[该CVE可能导致宿主机内存资源管理出现错误,严重时可能引发内存资源过度消耗或系统不稳定,严重程度为中高]"}