普通视图

发现新文章,点击刷新页面。
昨天以前Frytea's Blog

openFuyao NPU-Operator故障排查

作者 tl.s
2026年4月13日 17:01

故障 pod describe

[root@master1 ~]# kubectl -n kube-system describe pod ascend-device-plugin-ll46f 
Name:                 ascend-device-plugin-ll46f
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      ascend-device-plugin-sa
Node:                 master1/10.17.30.131
Start Time:           Mon, 30 Mar 2026 11:08:32 +0800
Labels:               app.kubernetes.io/managed-by=npu-operator
                      controller-revision-hash=7df5dcb887
                      helm.sh/chart=npu-operator-0.15.0
                      name=ascend-device-plugin-ds
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: c1f2adcaeaaf2bdcf0a6e09730f68231a293074e31d58f61997f714dfb520878
                      cni.projectcalico.org/podIP: 192.168.137.118/32
                      cni.projectcalico.org/podIPs: 192.168.137.118/32
                      scheduler.alpha.kubernetes.io/critical-pod: 
                      seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:               Running
IP:                   192.168.137.118
IPs:
  IP:           192.168.137.118
Controlled By:  DaemonSet/ascend-device-plugin
Init Containers:
  init-permission:
    Container ID:  containerd://4406968a522bea48dfefebae81ec53644312762af4781c25de689952ed6c2d27
    Image:         cr.openfuyao.cn/openfuyao/busybox:1.36.1
    Image ID:      cr.openfuyao.cn/openfuyao/busybox@sha256:4b8407fadd8100c61b097d63efe992b2c033e7d371c9117f7a9462fe87e31176
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      chown 9000:9000 /var/log/mindx-dl /var/log/mindx-dl/devicePlugin
      chmod 750 /var/log/mindx-dl/devicePlugin
      
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 30 Mar 2026 15:28:32 +0800
      Finished:     Mon, 30 Mar 2026 15:28:32 +0800
    Ready:          True
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /var/log/mindx-dl/devicePlugin from log-path (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gfldg (ro)
Containers:
  device-plugin-01:
    Container ID:  containerd://fcc0c4742285847e2621a9a9217502307fc7e28644fbf86b32f9c11d67a2c0ab
    Image:         cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0
    Image ID:      cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin@sha256:a5b9612b21bcd35384f9f19a05b2d7915b865e7b2be6a30bfd7806a9b8a86f58
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      --
    Args:
      device-plugin  -useAscendDocker=true -volcanoType=false -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 31 Mar 2026 10:28:58 +0800
      Finished:     Tue, 31 Mar 2026 10:28:58 +0800
    Ready:          False
    Restart Count:  274
    Limits:
      cpu:     500m
      memory:  500Mi
    Requests:
      cpu:     500m
      memory:  500Mi
    Environment:
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /tmp from tmp (rw)
      /usr/local/Ascend/driver from hiai-driver (ro)
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/lib/kubelet/pod-resources from pod-resource (rw)
      /var/log/mindx-dl/devicePlugin from log-path (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gfldg (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  pod-resource:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/pod-resources
    HostPathType:  
  hiai-driver:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/Ascend/driver
    HostPathType:  
  log-path:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/mindx-dl/devicePlugin
    HostPathType:  DirectoryOrCreate
  tmp:
    Type:          HostPath (bare host directory volume)
    Path:          /tmp
    HostPathType:  
  kube-api-access-gfldg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              openfuyao.com/npu.present=
Tolerations:                 CriticalAddonsOnly op=Exists
                             device-plugin=v2:NoSchedule
                             huawei.com/Ascend910:NoSchedule op=Exists
                             node-role.kubernetes.io/control-plane:NoSchedule
                             node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulled   16m (x205 over 18h)     kubelet  (combined from similar events): Successfully pulled image "cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0" in 403ms (403ms including waiting). Image size: 48017174 bytes.
  Warning  BackOff  2m47s (x5216 over 18h)  kubelet  Back-off restarting failed container device-plugin-01 in pod ascend-device-plugin-ll46f_kube-system(8edcd384-ab2d-4998-8077-5ac58801c79e)
  Normal   Pulling  66s (x227 over 19h)     kubelet  Pulling image "cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0"

故障 pod /dev 检查

[root@master1 fuyao-26.3-rc3]# kubectl  -n kube-system exec -it daemonsets/ascend-device-plugin -- ls /dev
Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init)
autofs           null               tty10  tty34  tty58    vcs5
bsg              ppp                tty11  tty35  tty59    vcs6
btrfs-control    ptmx               tty12  tty36  tty6     vcsa
bus              pts                tty13  tty37  tty60    vcsa1
core             random             tty14  tty38  tty61    vcsa2
cpu_dma_latency  raw                tty15  tty39  tty62    vcsa3
cuse             relationship_ctrl  tty16  tty4   tty63    vcsa4
davinci0         rfkill             tty17  tty40  tty7     vcsa5
davinci_manager  rtc0               tty18  tty41  tty8     vcsa6
devmm_svm        sda                tty19  tty42  tty9     vcsu
dri              sda1               tty2   tty43  ttyAMA0  vcsu1
fb0              sda2               tty20  tty44  ttyS0    vcsu2
fd               sg0                tty21  tty45  ttyS1    vcsu3
full             sg1                tty22  tty46  ttyS2    vcsu4
fuse             sg2                tty23  tty47  ttyS3    vcsu5
hidraw0          shm                tty24  tty48  uhid     vcsu6
hidraw1          snapshot           tty25  tty49  uinput   vfio
hisi_hdc         sr0                tty26  tty5   urandom  vga_arbiter
hwrng            sr1                tty27  tty50  usbmon0  vhost-net
input            stderr             tty28  tty51  usbmon1  vhost-vsock
kmsg             stdin              tty29  tty52  usbmon2  vport2p1
loop-control     stdout             tty3   tty53  vcs      zero
mapper           termination-log    tty30  tty54  vcs1
mem              tty                tty31  tty55  vcs2
mqueue           tty0               tty32  tty56  vcs3
net              tty1               tty33  tty57  vcs4

故障 pod 驱动检查

[root@master1 fuyao-26.3-rc3]# kubectl  -n kube-system exec -it daemonsets/ascend-device-plugin -- ls -lha /usr/local/Ascend/driver
Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init)
total 44K
drwxr-xr-x  8 root root 4.0K Mar 27 08:03 .
drwxr-xr-x  3 root root 4.0K Mar 31 02:34 ..
drwxr-xr-x  2 root root 4.0K Mar 27 08:01 bin
-r--r--r--  1 root root   20 Mar 27 08:01 build.info
dr-xr-x---  2 root root 4.0K Mar 27 08:01 device
dr-x------ 41 root root 4.0K Mar 27 08:01 kernel
drwxr-xr-x  6 root root 4.0K Mar 27 08:01 lib64
-r--r-----  1 root root   56 Mar 27 08:01 scene.info
dr-xr-x---  2 root root 4.0K Mar 27 08:01 script
drwxr-xr-x  2 root root 4.0K Mar 27 08:01 tools
-r--r--r--  1 root root  352 Mar 27 08:03 version.info

故障 pod 日志

[root@master1 ~]# kubectl -n kube-system logs daemonsets/ascend-device-plugin --previous
Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init)
[INFO]     2026/03/31 06:46:54.593254 1       hwlog/api.go:108    devicePlugin.log's logger init success
[INFO]     2026/03/31 06:46:54.593449 1       main.go:187    ascend device plugin starting and the version is v6.0.0_linux-aarch64
[INFO]     2026/03/31 06:46:54.593494 1       main.go:188    ascend device plugin starting scene is center
[INFO]     2026/03/31 06:46:54.787930 1       devmanager/devmanager.go:104    the dcmi version is 24.1.rc3
[ERROR]    2026/03/31 06:46:54.788019 1       devmanager/devmanager.go:211    get error card quantity: 0
[ERROR]    2026/03/31 06:46:54.788052 1       devmanager/devmanager.go:195    get card list failed for init
[ERROR]    2026/03/31 06:46:54.788101 1       main.go:203    init devmanager failed, err: auto init failed, err: get card list failed for init

故障 pod 驱动检查

[root@master1 ~]# kubectl -n kube-system exec -it daemonsets/ascend-device-plugin -- bash -c 'find /usr/local/Ascend/driver -name libdcmi.so 2>/dev/null; echo $LD_LIBRARY_PATH'
Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init)
/usr/local/Ascend/driver/lib64/driver/libdcmi.so
command terminated with exit code 137
[root@master1 ~]# ps -ef | grep -E 'dmp_daemon|slogd' | grep -v grep
root       21578       1  0 Mar30 ?        00:00:19 /usr/sbin/rsyslogd -n -i/var/run/rsyslogd.pid

检查服务状态?

[root@master1 ~]# systemctl status ascend-dmi
Unit ascend-dmi.service could not be found.
[root@master1 ~]# systemctl status ascend-dkms
Unit ascend-dkms.service could not be found.
[root@master1 ~]# systemctl status npu-smi
Unit npu-smi.service could not be found.
[root@master1 ~]# find / -name dmp_daemon 2>/dev/null
[root@master1 ~]# find / -name slogd 2>/dev/null
[root@master1 ~]# ls -l /var/dmp_daemon /var/slogd 2>/dev/null
[root@master1 ~]# 

dcmi 问题,需硬件排查

检查卡获取

#include <stdlib.h>
#include <stdio.h>
#include "dcmi_interface_api.h"

int my_get_card_list();

int main(int argc,char *argv[])
{
    my_get_card_list();
    return 0;
}

int my_get_card_list()
{
    printf("\n==================================card id info list=========================\n");
    dcmi_init();
    int card_num = 0;
    int card_list[16] = {0};
    int ret = dcmi_get_card_list(&card_num, card_list, 16);
    if (ret != DCMI_OK) {
        printf("dcmi get card list failed ret=%d\n", ret);
    }
    printf("card_num=%d, card_list:[",card_num);
    for (int i = 0; i < card_num; i++) {
        printf("%d ", card_list[i]);
    }
}
cc ./test1.c -o test1 -I /usr/local/dcmi -L /usr/local/dcmi -ldcmi

-I头文件(.h)搜索路径
-L库文件(.so/.a)搜索路径
-l链接的库名(去掉 lib 前缀)
nerdctl run --rm \
  -v /usr/local/Ascend:/usr/local/Ascend \
  -v /usr/local/dcmi:/usr/local/dcmi \
  -v $(pwd):/build \
  ubuntu:18.04 bash -c "
    sed -i -e 's@http*://ports.ubuntu.com/\? @http://10.17.31.217:8081/repository/mirror-ubuntu-ports/@g' \
           -e 's@http*://ports.ubuntu.com@http://10.17.31.217:8081/repository/mirror-ubuntu-ports@g' \
           /etc/apt/sources.list
    apt update && apt install -y gcc
    cd /build
    cc ./test1.c -o test1 \
      -I /usr/local/dcmi \
      -L /usr/local/dcmi \
      -L /usr/local/Ascend/driver/lib64/common \
      -L /usr/local/Ascend/driver/lib64/driver \
      -ldcmi \
      -Wl,-rpath,/usr/local/Ascend/driver/lib64/common \
      -Wl,-rpath,/usr/local/Ascend/driver/lib64/driver \
      -Wl,-rpath,/usr/local/dcmi
  "

分析二进制:

[root@master1 ascend_debug]# ldd ./test1 | grep -i dcmi
        libdcmi.so => /usr/local/Ascend/driver/lib64/driver/libdcmi.so (0x0000ffffa6dd0000)
[root@master1 ascend_debug]# LD_DEBUG=libs ./test1 2>&1 | grep -i dcmi
    284830:     find library=libdcmi.so [0]; searching
    284830:      search path=/usr/local/Ascend/driver/lib64/common/tls/aarch64/atomics:/usr/local/Ascend/driver/lib64/common/tls/aarch64:/usr/local/Ascend/driver/lib64/common/tls/atomics:/usr/local/Ascend/driver/lib64/common/tls:/usr/local/Ascend/driver/lib64/common/aarch64/atomics:/usr/local/Ascend/driver/lib64/common/aarch64:/usr/local/Ascend/driver/lib64/common/atomics:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver/tls/aarch64/atomics:/usr/local/Ascend/driver/lib64/driver/tls/aarch64:/usr/local/Ascend/driver/lib64/driver/tls/atomics:/usr/local/Ascend/driver/lib64/driver/tls:/usr/local/Ascend/driver/lib64/driver/aarch64/atomics:/usr/local/Ascend/driver/lib64/driver/aarch64:/usr/local/Ascend/driver/lib64/driver/atomics:/usr/local/Ascend/driver/lib64/driver:/usr/local/dcmi/tls/aarch64/atomics:/usr/local/dcmi/tls/aarch64:/usr/local/dcmi/tls/atomics:/usr/local/dcmi/tls:/usr/local/dcmi/aarch64/atomics:/usr/local/dcmi/aarch64:/usr/local/dcmi/atomics:/usr/local/dcmi            (RUNPATH from file ./test1)
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/tls/aarch64/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/tls/aarch64/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/tls/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/tls/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/aarch64/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/aarch64/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/tls/aarch64/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/tls/aarch64/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/tls/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/tls/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/aarch64/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/aarch64/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/libdcmi.so
    284830:      search path=/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/dcmi/tls/aarch64/atomics:/usr/local/dcmi/tls/aarch64:/usr/local/dcmi/tls/atomics:/usr/local/dcmi/tls:/usr/local/dcmi/aarch64/atomics:/usr/local/dcmi/aarch64:/usr/local/dcmi/atomics:/usr/local/dcmi          (RUNPATH from file ./test1)
    284830:       trying file=/usr/local/dcmi/tls/aarch64/atomics/libc.so.6
    284830:       trying file=/usr/local/dcmi/tls/aarch64/libc.so.6
    284830:       trying file=/usr/local/dcmi/tls/atomics/libc.so.6
    284830:       trying file=/usr/local/dcmi/tls/libc.so.6
    284830:       trying file=/usr/local/dcmi/aarch64/atomics/libc.so.6
    284830:       trying file=/usr/local/dcmi/aarch64/libc.so.6
    284830:       trying file=/usr/local/dcmi/atomics/libc.so.6
    284830:       trying file=/usr/local/dcmi/libc.so.6
    284830:     calling init: /usr/local/Ascend/driver/lib64/driver/libdcmi.so
    284830:     calling fini: /usr/local/Ascend/driver/lib64/driver/libdcmi.so [0]
[root@master1 ascend_debug]# readlink -f /usr/local/dcmi/libdcmi.so
/usr/local/dcmi/libdcmi.so
[root@master1 ascend_debug]# readlink -f /usr/local/Ascend/driver/lib64/driver/libdcmi.so
/usr/local/Ascend/driver/lib64/driver/libdcmi.so
[root@master1 ascend_debug]# sha256sum /usr/local/dcmi/libdcmi.so /usr/local/Ascend/driver/lib64/driver/libdcmi.so
13a38cae84bad0f06367ff9280016e372c0608ca16465b5ae5f000d3844ee401  /usr/local/dcmi/libdcmi.so
13a38cae84bad0f06367ff9280016e372c0608ca16465b5ae5f000d3844ee401  /usr/local/Ascend/driver/lib64/driver/libdcmi.so

跟踪 strace

宿主机跑
strace -f -o /tmp/host.strace -e trace=file,ioctl ./test1
容器里跑
strace -f -o /tmp/container.strace -e trace=file,ioctl ./test1

新增挂载继续跟踪

volumeMounts:
        - name: hdc-basic
          mountPath: /etc/hdcBasic.cfg
          readOnly: true
        - name: localtime
          mountPath: /etc/localtime
          readOnly: true
      volumes:
      - name: hdc-basic
        hostPath:
          path: /etc/hdcBasic.cfg
          type: File
      - name: localtime
        hostPath:
          path: /etc/localtime
          type: File

跟踪并查看日志

kubectl -n kube-system exec -it ascend-device-plugin-69q5t -c device-plugin-01 -- bash

strace -f -o /tmp/container.strace -e trace=file,ioctl ./test1

root@ascend-device-plugin-69q5t:/tmp# strace -f -o /tmp/container.strace -e trace=file,ioctl ./test1

==================================card id info list=========================
card_num=0, card_list:[


root@ascend-device-plugin-69q5t:/tmp# cat /var/log/nputools_LOG_INFO.log > /tmp/nputools_LOG_INFO.log
root@ascend-device-plugin-69q5t:/tmp# cat /var/log/nputools_LOG_ERR.log > /tmp/nputools_LOG_ERR.log
cat: /var/log/nputools_LOG_ERR.log: No such file or directory
root@ascend-device-plugin-69q5t:/tmp# cat /tmp/nputools_LOG_INFO.log 
[2026/04/01 11:18:12][0583][root][127.0.0.1][dcmi_api.c,dcmi_board_init,86]:dcmi board init success. device_count=1.
[2026/04/01 11:18:12][0583][root][127.0.0.1][dcmi_api.c,dcmi_init,119]:dcmi init all success.

检查代码2

#include <stdio.h>
#include <stdlib.h>
#include "dcmi_interface_api.h"

#ifndef DCMI_OK
#define DCMI_OK 0
#endif

/* 头文件里没看到这个声明,手动补一个 */
extern int dcmi_get_card_num_list(int *card_num, int *card_list, int list_length);

static void print_list(const char *name, int ret, int num, int *list) {
    printf("%s ret=%d num=%d list=[", name, ret, num);
    for (int i = 0; i < num; ++i) {
        printf("%d ", list[i]);
    }
    printf("]\n");
}

int main(void) {
    int ret = dcmi_init();
    printf("dcmi_init ret=%d\n", ret);
    if (ret != DCMI_OK) {
        return 1;
    }

    int card_num = 0;
    int card_list[16] = {0};

    ret = dcmi_get_card_list(&card_num, card_list, 16);
    print_list("dcmi_get_card_list", ret, card_num, card_list);

    int card_num2 = 0;
    int card_list2[16] = {0};
    ret = dcmi_get_card_num_list(&card_num2, card_list2, 16);
    print_list("dcmi_get_card_num_list", ret, card_num2, card_list2);

    for (int i = 0; i < card_num && i < 16; ++i) {
        int dev_num = -1;
        ret = dcmi_get_device_num_in_card(card_list[i], &dev_num);
        printf("dcmi_get_device_num_in_card card=%d ret=%d dev_num=%d\n",
               card_list[i], ret, dev_num);
    }

    return 0;
}

主机编译

cc ./test2.c -o test2 -I /usr/local/dcmi -L /usr/local/dcmi -ldcmi

容器编译

nerdctl run --rm \
  -v /usr/local/Ascend:/usr/local/Ascend \
  -v /usr/local/dcmi:/usr/local/dcmi \
  -v $(pwd):/build \
  ubuntu:18.04 bash -c "
    sed -i -e 's@http*://ports.ubuntu.com/\? @http://10.17.31.217:8081/repository/mirror-ubuntu-ports/@g' \
           -e 's@http*://ports.ubuntu.com@http://10.17.31.217:8081/repository/mirror-ubuntu-ports@g' \
           /etc/apt/sources.list
    apt update && apt install -y gcc
    cd /build
    cc ./test2.c -o test2 \
      -I /usr/local/dcmi \
      -L /usr/local/dcmi \
      -L /usr/local/Ascend/driver/lib64/common \
      -L /usr/local/Ascend/driver/lib64/driver \
      -ldcmi \
      -Wl,-rpath,/usr/local/Ascend/driver/lib64/common \
      -Wl,-rpath,/usr/local/Ascend/driver/lib64/driver \
      -Wl,-rpath,/usr/local/dcmi
  "

拷入容器运行

kubectl -n kube-system cp ./test2 ascend-device-plugin-69q5t:/tmp/
# 主机运行
[root@master1 ascend_debug]# ./test2
dcmi_init ret=0
dcmi_get_card_list ret=0 num=1 list=[176 ]
dcmi_get_card_num_list ret=0 num=1 list=[176 ]
dcmi_get_device_num_in_card card=176 ret=0 dev_num=1

# 容器运行
root@ascend-device-plugin-69q5t:/tmp# ./test2 
dcmi_init ret=0
dcmi_get_card_list ret=0 num=0 list=[]
dcmi_get_card_num_list ret=0 num=0 list=[]

虚拟机场景

经过许老师认真定位,最终发现是因为非裸金属环境。虚拟机场景需要定制镜像。

根据官网文档

如果在虚拟机场景下部署Ascend Device Plugin,需要在Ascend Device Plugin的镜像中安装systemd,推荐在Dockerfile中加入RUN apt-get update && apt-get install -y systemd命令进行安装。

为了使用 nerdctl 构建镜像首先安装 buildkit

wegt https://github.com/moby/buildkit/releases/download/v0.29.0/buildkit-v0.29.0.linux-arm64.tar.gz
tar zxvf buildkit-v0.29.0.linux-arm64.tar.gz
cp bin/* /usr/local/bin/

之后找一个新终端启动 buildkit ,这里是为了 nerdctl 构建 image, 如果不需要则不用启动。

buildkitd --oci-worker=false --containerd-worker=true --containerd-worker-namespace=k8s.io 

Dockerfile 如下:

镜像源部分按需修改
FROM hub.oepkgs.net/openfuyao/ascendhub/ascend-k8sdeviceplugin:v6.0.0

# 替换 apt 镜像源
RUN sed -i \
    -e 's@http*://ports.ubuntu.com/\? @http://10.17.31.217:8081/repository/mirror-ubuntu-ports/@g' \
    -e 's@http*://ports.ubuntu.com@http://10.17.31.217:8081/repository/mirror-ubuntu-ports@g' \
    /etc/apt/sources.list

# 安装 systemd
RUN apt-get update && \
    apt-get install -y --no-install-recommends systemd systemd-sysv && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# 设置 systemd 为 init
STOPSIGNAL SIGRTMIN+3
STOPSIGNAL SIGRTMIN+3 是告诉容器运行时(containerd/docker)停止这个容器时应该发送哪个信号。
如果你的容器里 不跑 systemd 作为主进程(比如 entrypoint 是业务程序),这行可以删掉,没有任何作用。
如果确实用 systemd 管理容器内服务,保留它能避免 kubectl delete pod 时等待 30 秒超时再强杀的问题。

构建命令如下:

nerdctl build \
  --namespace k8s.io \
  -t hub.oepkgs.net/openfuyao/ascendhub/ascend-k8sdeviceplugin:v6.0.0-systemd \
  -f Dockerfile \
  .

之后将出问题的镜像替换为新构建的镜像即可。

npu-operator 有同样的问题,一样修改即可。

修复确认

最终在 node 中能看到 npu 资源即成功。
[root@master1 ~]# kubectl  describe node master1 
Name:               master1
Roles:              control-plane,master,node,worker
Labels:             accelerator=huawei-Ascend310P
                    beta.kubernetes.io/arch=arm64
                    beta.kubernetes.io/os=linux
...
                    servertype=Ascend310P-8
                    workerselector=dls-worker-node
Annotations:        baseDeviceInfos: {"Ascend310P-0":{"IP":"","SuperDeviceID":0}}
...
Capacity:
  cpu:                    16
  ephemeral-storage:      129724184Ki
  huawei.com/Ascend310P:  1
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  hugepages-32Mi:         0
  hugepages-64Ki:         0
  memory:                 32595632Ki
  pods:                   110
Allocatable:
  cpu:                    16
  ephemeral-storage:      119553807777
  huawei.com/Ascend310P:  1
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  hugepages-32Mi:         0
  hugepages-64Ki:         0
  memory:                 32493232Ki
  pods:                   110
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource               Requests       Limits
  --------               --------       ------
  cpu                    15565m (97%)   16910m (105%)
  memory                 17492Mi (55%)  30900Mi (97%)
  ephemeral-storage      0 (0%)         0 (0%)
  hugepages-1Gi          0 (0%)         0 (0%)
  hugepages-2Mi          0 (0%)         0 (0%)
  hugepages-32Mi         0 (0%)         0 (0%)
  hugepages-64Ki         0 (0%)         0 (0%)
  huawei.com/Ascend310P  0              0
...

openFuyao 2603 共测测试报告

作者 tl.s
2026年4月13日 17:01

相关链接

测试环境

  • CPU: Kunpeng-920
  • OS: openEuler 24.03 LTS SP3 aarch64
  • Fuyao Version: v26.03 rc3
  • docker: 2:18.09.0-346.oe2403sp3

测试特性

  • [ ] 在线部署;
  • [ ] 离线包制备;
  • [ ] 离线部署;
  • [ ] 安装部署前置检查工具;
  • [ ] NPU Operator;
  • [ ] AI推理套件;

    建议优化点

  • 环境检测工具,检查 iptables 默认策略是否放行,若未放行可能在部署成功后无法访问;默认防火墙策略为 FORWARD DROP ,对集群运行和访问带来的潜在问题;
  • 运行 cli 前检查是否存在命令并及时抛出错误;检查 tar / unzip 是否安装,安装过程有很多地方会用到,而且出错时不会得到明显的解压失败报错,难以定位问题。
  • 安装命令变化,考虑上下兼容性?

    场景记录

离线部署管理面和业务面集群

  • CPU: Kunpeng-920
  • OS: openEuler 24.03 LTS SP3 aarch64
  • Fuyao Version: v26.03 rc3
  • docker: 2:18.09.0-346.oe2403sp3

    arm64 环境下构建离线制品包为什么会执行 amd64 的 bin

    [bke][2026-03-26 07:48:49][INFO] The bke binary file version is . sh: line 1: /root/fuyao-26-03/packages/usr/bin/bkeadm_linux_amd64: cannot execute binary file: Exec format error

完整日志

[root@master1 fuyao-26-03]# cat build-offline-package.log | grep -v sha256 
2026-03-26T06:49:56.467+0800    info    infrastructure/infrastructure.go:53     The docker client is ready.
[bke][2026-03-26 06:49:56][step.1] Configuration file check
[bke][2026-03-26 06:49:56][step.2] Creates a workspace in the current directory
[bke][2026-03-26 06:49:56][step.5] Collect the required image files
[bke][2026-03-26 06:49:56][INFO] Try pulling away the mirror image cr.openfuyao.cn/openfuyao/registry:2.8.1
[bke][2026-03-26 06:49:56][step.3] Collect host dependency packages and package files
[bke][2026-03-26 06:49:56][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/kubernetes/kubernetes/releases/download/1.34.3-of.1/bin/linux/arm64/kubectl to /root/fuyao-26-03/packages/tmp/packages/files/kubectl-v1.34.3-of.1-arm64
[bke][2026-03-26 06:50:07][INFO] Try pulling away the mirror image cr.openfuyao.cn/openfuyao/registry:2.8.1
[bke][2026-03-26 06:50:18][step.6] Collect images from the source repository to the target repository
[bke][2026-03-26 06:50:18][INFO] Remove the image repository
2026-03-26T06:50:18.026+0800    info    infrastructure/infrastructure.go:53     The docker client is ready.
2026-03-26T06:50:18.028+0800    info    infrastructure/infrastructure.go:53     The docker client is ready.
[bke][2026-03-26 06:50:20][WARN] Image cr.openfuyao.cn/openfuyao/registry:2.8.1 inspect failed: Error response from daemon: no such image: cr.openfuyao.cn/openfuyao/registry:2.8.1: No such image: cr.openfuyao.cn/openfuyao/registry:2.8.1, retrying (1/3)...
[bke][2026-03-26 06:50:21][WARN] Image cr.openfuyao.cn/openfuyao/registry:2.8.1 inspect failed: Error response from daemon: no such image: cr.openfuyao.cn/openfuyao/registry:2.8.1: No such image: cr.openfuyao.cn/openfuyao/registry:2.8.1, retrying (2/3)...
[bke][2026-03-26 06:50:22][WARN] Image cr.openfuyao.cn/openfuyao/registry:2.8.1 inspect failed: Error response from daemon: no such image: cr.openfuyao.cn/openfuyao/registry:2.8.1: No such image: cr.openfuyao.cn/openfuyao/registry:2.8.1, retrying (3/3)...
[bke][2026-03-26 06:50:23][WARN] Get image cr.openfuyao.cn/openfuyao/registry:2.8.1 inspect failed: failed to inspect image cr.openfuyao.cn/openfuyao/registry:2.8.1 after 3 attempts: Error response from daemon: no such image: cr.openfuyao.cn/openfuyao/registry:2.8.1: No such image: cr.openfuyao.cn/openfuyao/registry:2.8.1
[bke][2026-03-26 06:50:23][INFO] Image cr.openfuyao.cn/openfuyao/registry:2.8.1 is downloading
[bke][2026-03-26 06:50:37][INFO] Wait for the container mirroring service to start...
[bke][2026-03-26 06:50:42][INFO] The container mirroring service is started. 
Getting image list signatures
Copying 6 images generated from 6 images in list
Getting image source signatures
[bke][2026-03-26 06:50:56][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/kubernetes/kubernetes/releases/download/1.34.3-of.1/bin/linux/arm64/kubelet to /root/fuyao-26-03/packages/tmp/packages/files/kubelet-v1.34.3-of.1-arm64
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 7 images generated from 7 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
[bke][2026-03-26 06:52:10][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/kubernetes/kubernetes/releases/download/1.34.3-of.1/bin/linux/amd64/kubectl to /root/fuyao-26-03/packages/tmp/packages/files/kubectl-v1.34.3-of.1-amd64
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 5 images generated from 5 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
[bke][2026-03-26 06:53:23][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/kubernetes/kubernetes/releases/download/1.34.3-of.1/bin/linux/amd64/kubelet to /root/fuyao-26-03/packages/tmp/packages/files/kubelet-v1.34.3-of.1-amd64
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
[bke][2026-03-26 06:54:34][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/containerd/containerd/releases/download/v2.1.1-origin/containerd-v2.1.1-linux-amd64.tar.gz to /root/fuyao-26-03/packages/tmp/packages/files/containerd-v2.1.1-linux-amd64.tar.gz
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 4 images generated from 4 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 3 images generated from 3 images in list
Getting image source signatures
[bke][2026-03-26 06:55:53][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/containerd/containerd/releases/download/v2.1.1-origin/containerd-v2.1.1-linux-arm64.tar.gz to /root/fuyao-26-03/packages/tmp/packages/files/containerd-v2.1.1-linux-arm64.tar.gz
Writing manifest to image destination
Getting image source signatures
[bke][2026-03-26 06:57:07][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/containernetworking/plugins/releases/download/v1.4.1/cni-plugins-linux-amd64-v1.4.1.tgz to /root/fuyao-26-03/packages/tmp/packages/files/cni-plugins-linux-amd64-v1.4.1.tgz
[bke][2026-03-26 06:58:00][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/containernetworking/plugins/releases/download/v1.4.1/cni-plugins-linux-arm64-v1.4.1.tgz to /root/fuyao-26-03/packages/tmp/packages/files/cni-plugins-linux-arm64-v1.4.1.tgz
Writing manifest to image destination
Getting image source signatures
[bke][2026-03-26 06:58:49][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/helm/releases/download/v3.14.2/helm-v3.14.2-linux-amd64.tar.gz to /root/fuyao-26-03/packages/tmp/packages/files/helm-v3.14.2-linux-amd64.tar.gz
[bke][2026-03-26 06:59:08][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/helm/releases/download/v3.14.2/helm-v3.14.2-linux-arm64.tar.gz to /root/fuyao-26-03/packages/tmp/packages/files/helm-v3.14.2-linux-arm64.tar.gz
[bke][2026-03-26 06:59:24][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/mikefarah/yq/releases/download/v4.43.1/yq_linux_arm64 to /root/fuyao-26-03/packages/tmp/packages/files/yq_linux_arm64
[bke][2026-03-26 06:59:35][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/mikefarah/yq/releases/download/v4.43.1/yq_linux_amd64 to /root/fuyao-26-03/packages/tmp/packages/files/yq_linux_amd64
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 10 images generated from 10 images in list
Getting image source signatures
[bke][2026-03-26 06:59:44][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/jqlang/jq/releases/download/v1.7.1/jq-linux-arm64 to /root/fuyao-26-03/packages/tmp/packages/files/jq-linux-arm64
[bke][2026-03-26 06:59:46][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/jqlang/jq/releases/download/v1.7.1/jq-linux-amd64 to /root/fuyao-26-03/packages/tmp/packages/files/jq-linux-amd64
[bke][2026-03-26 06:59:49][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/cloudflare/cfssl/releases/download/v1.6.4/cfssl-certinfo_1.6.4_linux_arm64 to /root/fuyao-26-03/packages/tmp/packages/files/cfssl-certinfo_1.6.4_linux_arm64
[bke][2026-03-26 07:00:00][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/cloudflare/cfssl/releases/download/v1.6.4/cfssl-certinfo_1.6.4_linux_amd64 to /root/fuyao-26-03/packages/tmp/packages/files/cfssl-certinfo_1.6.4_linux_amd64
[bke][2026-03-26 07:00:11][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/cloudflare/cfssl/releases/download/v1.6.4/cfssl_1.6.4_linux_arm64 to /root/fuyao-26-03/packages/tmp/packages/files/cfssl_1.6.4_linux_arm64
[bke][2026-03-26 07:00:24][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/cloudflare/cfssl/releases/download/v1.6.4/cfssl_1.6.4_linux_amd64 to /root/fuyao-26-03/packages/tmp/packages/files/cfssl_1.6.4_linux_amd64
[bke][2026-03-26 07:00:38][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/cloudflare/cfssl/releases/download/v1.6.4/cfssljson_1.6.4_linux_arm64 to /root/fuyao-26-03/packages/tmp/packages/files/cfssljson_1.6.4_linux_arm64
[bke][2026-03-26 07:00:46][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/cloudflare/cfssl/releases/download/v1.6.4/cfssljson_1.6.4_linux_amd64 to /root/fuyao-26-03/packages/tmp/packages/files/cfssljson_1.6.4_linux_amd64
[bke][2026-03-26 07:00:55][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/opencontainers/runc/releases/download/v1.1.12/runc-arm64 to /root/fuyao-26-03/packages/tmp/packages/files/runc-arm64
[bke][2026-03-26 07:01:07][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/opencontainers/runc/releases/download/v1.1.12/runc-amd64 to /root/fuyao-26-03/packages/tmp/packages/files/runc-amd64
[bke][2026-03-26 07:01:19][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/etcd-io/etcd/releases/download/v3.5.6/etcdctl-v3.5.6-linux-amd64 to /root/fuyao-26-03/packages/tmp/packages/files/etcdctl-v3.5.6-linux-amd64
[bke][2026-03-26 07:01:40][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/etcd-io/etcd/releases/download/v3.5.6/etcdctl-v3.5.6-linux-arm64 to /root/fuyao-26-03/packages/tmp/packages/files/etcdctl-v3.5.6-linux-arm64
Writing manifest to image destination
Getting image source signatures
[bke][2026-03-26 07:01:58][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/rpm/releases/download/v0.0.1/rpm.tar.gz to /root/fuyao-26-03/packages/tmp/packages/files/rpm.tar.gz
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 7 images generated from 7 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 8 images generated from 8 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 4 images generated from 4 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 4 images generated from 4 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
[bke][2026-03-26 07:45:30][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/nfs/nfsshare.tar.gz to /root/fuyao-26-03/packages/tmp/packages/files/nfsshare.tar.gz
[bke][2026-03-26 07:45:30][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/openFuyao/bkeadm/releases/download/1.2.1/bkeadm_linux_amd64 to /root/fuyao-26-03/packages/tmp/packages/files/bkeadm_linux_amd64
[bke][2026-03-26 07:46:48][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/openFuyao/bkeadm/releases/download/1.2.1/bkeadm_linux_arm64 to /root/fuyao-26-03/packages/tmp/packages/files/bkeadm_linux_arm64
[bke][2026-03-26 07:48:03][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/charts/releases/download/oauth-webhook-1.0.2.tgz to /root/fuyao-26-03/packages/tmp/charts/oauth-webhook-1.0.2.tgz
Writing manifest to image destination
[bke][2026-03-26 07:48:03][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/charts/releases/download/oauth-server-1.0.3.tgz to /root/fuyao-26-03/packages/tmp/charts/oauth-server-1.0.3.tgz
[bke][2026-03-26 07:48:04][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/charts/releases/download/console-website-1.0.4.tgz to /root/fuyao-26-03/packages/tmp/charts/console-website-1.0.4.tgz
[bke][2026-03-26 07:48:04][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/charts/releases/download/monitoring-service-1.0.4.tgz to /root/fuyao-26-03/packages/tmp/charts/monitoring-service-1.0.4.tgz
[bke][2026-03-26 07:48:04][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/charts/releases/download/console-service-1.0.4.tgz to /root/fuyao-26-03/packages/tmp/charts/console-service-1.0.4.tgz
[bke][2026-03-26 07:48:04][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/charts/releases/download/marketplace-service-1.0.3.tgz to /root/fuyao-26-03/packages/tmp/charts/marketplace-service-1.0.3.tgz
[bke][2026-03-26 07:48:04][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/charts/releases/download/application-management-service-1.0.2.tgz to /root/fuyao-26-03/packages/tmp/charts/application-management-service-1.0.2.tgz
[bke][2026-03-26 07:48:04][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/charts/releases/download/plugin-management-service-1.0.2.tgz to /root/fuyao-26-03/packages/tmp/charts/plugin-management-service-1.0.2.tgz
[bke][2026-03-26 07:48:04][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/charts/releases/download/user-management-operator-1.0.2.tgz to /root/fuyao-26-03/packages/tmp/charts/user-management-operator-1.0.2.tgz
[bke][2026-03-26 07:48:04][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/charts/releases/download/web-terminal-service-1.0.3.tgz to /root/fuyao-26-03/packages/tmp/charts/web-terminal-service-1.0.3.tgz
[bke][2026-03-26 07:48:04][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/charts/releases/download/installer-service-1.2.1.tgz to /root/fuyao-26-03/packages/tmp/charts/installer-service-1.2.1.tgz
[bke][2026-03-26 07:48:04][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/charts/releases/download/installer-website-1.2.1.tgz to /root/fuyao-26-03/packages/tmp/charts/installer-website-1.2.1.tgz
[bke][2026-03-26 07:48:04][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/charts/releases/download/harbor-1.11.4.tgz to /root/fuyao-26-03/packages/tmp/charts/harbor-1.11.4.tgz
Getting image source signatures
[bke][2026-03-26 07:48:04][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/charts/releases/download/bke-console-website-1.0.3.tgz to /root/fuyao-26-03/packages/tmp/charts/bke-console-website-1.0.3.tgz
[bke][2026-03-26 07:48:05][INFO] Collecting file packages https://openfuyao.obs.cn-north-4.myhuaweicloud.com/charts/releases/download/bke-console-service-1.0.2.tgz to /root/fuyao-26-03/packages/tmp/charts/bke-console-service-1.0.2.tgz
[bke][2026-03-26 07:48:49][step.4] Collect the bke binary file
[bke][2026-03-26 07:48:49][INFO] The bke binary file version is . sh: line 1: /root/fuyao-26-03/packages/usr/bin/bkeadm_linux_amd64: cannot execute binary file: Exec format error
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 4 images generated from 4 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 4 images generated from 4 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 17 images generated from 17 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 5 images generated from 5 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 5 images generated from 5 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 5 images generated from 5 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 16 images generated from 16 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 2 images generated from 2 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 4 images generated from 4 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 4 images generated from 4 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 10 images generated from 10 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 5 images generated from 5 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 5 images generated from 5 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 5 images generated from 5 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 5 images generated from 5 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 4 images generated from 4 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 5 images generated from 5 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 5 images generated from 5 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 5 images generated from 5 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
Getting image list signatures
Copying 5 images generated from 5 images in list
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Getting image source signatures
Writing manifest to image destination
Writing manifest list to image destination
Storing list signatures
[bke][2026-03-26 08:41:56][INFO] The system starts to pack the image file.
[bke][2026-03-26 08:43:49][INFO] Remove the image repository
2026-03-26T08:43:49.940+0800    info    infrastructure/infrastructure.go:53     The docker client is ready.
[bke][2026-03-26 08:43:51][step.7] Build the bke package, please wait for the larger package...
[bke][2026-03-26 08:46:22][step.8] Packaging complete bke.tar.gz
openEuler 24.03 lts 部署 npu operator 有 pod 无法启动:
Every 1.0s: kubectl  get pod -A                                                                                                                                                                                                                                                                                                                          master1: Fri Mar 27 10:33:04 2026

NAMESPACE                     NAME                                                 READY   STATUS             RESTARTS      AGE
cluster-system                bkeagent-deployer-fcnfd                              1/1     Running            0             29m
default                       mindio-acp-g67n9                                     1/1     Running            0             7m56s
default                       mindio-tft-bk8mj                                     1/1     Running            0             7m57s
ingress-nginx                 ingress-nginx-admission-create-fh4rv                 0/1     Completed          0             28m
ingress-nginx                 ingress-nginx-admission-patch-nn79l                  0/1     Completed          2             28m
ingress-nginx                 ingress-nginx-controller-jdb5s                       1/1     Running            0             28m
kube-system                   ascend-device-plugin-5hmhf                           0/1     CrashLoopBackOff   6 (33s ago)   7m59s
kube-system                   calico-kube-controllers-9c987b475-db86h              1/1     Running            0             29m
kube-system                   calico-node-sfvjm                                    1/1     Running            0             29m
kube-system                   coredns-6bd9b76578-2mqzh                             1/1     Running            2 (29m ago)   29m
kube-system                   coredns-6bd9b76578-94pmp                             1/1     Running            2 (21m ago)   29m
kube-system                   etcd-master1                                         1/1     Running            0             28m
kube-system                   kube-apiserver-master1                               1/1     Running            0             27m
kube-system                   kube-controller-manager-master1                      1/1     Running            1 (27m ago)   29m
kube-system                   kube-proxy-rwpzc                                     1/1     Running            0             29m
kube-system                   kube-scheduler-master1                               1/1     Running            1 (27m ago)   29m
kube-system                   metrics-server-db68b78d-b8kkn                        1/1     Running            0             26m
mindx-dl                      ascend-operator-manager-5d4d89f675-8dzn4             0/1     Pending            0             7m59s
mindx-dl                      clusterd-5588c5dc88-qmblm                            0/1     Pending            0             7m57s
mindx-dl                      noded-rc5lj                                          1/1     Running            0             7m59s
mindx-dl                      resilience-controller-8686bbd76f-qzc2z               0/1     Pending            0             7m57s
monitoring                    alertmanager-main-0                                  2/2     Running            0             28m
monitoring                    alertmanager-main-1                                  2/2     Running            0             28m
monitoring                    alertmanager-main-2                                  2/2     Running            0             28m
monitoring                    blackbox-exporter-6877f5c5f7-66qn4                   3/3     Running            0             28m
monitoring                    kube-state-metrics-6f6c47f5f4-4pt9s                  3/3     Running            0             28m
monitoring                    node-exporter-ng468                                  2/2     Running            0             28m
monitoring                    prometheus-k8s-0                                     2/2     Running            0             28m
monitoring                    prometheus-k8s-1                                     2/2     Running            0             28m
monitoring                    prometheus-operator-6698d7bc85-72xwx                 2/2     Running            0             28m
npu-exporter                  npu-exporter-vnzk6                                   0/1     Pending            0             7m57s
npu                           ascend-runtime-containerd-7s6jv                      1/1     Running            0             8m   
npu                           npu-driver-l7rjp                                     1/1     Running            0             8m   
npu                           npu-feature-discovery-xr7nt                          1/1     Running            0             19m
npu                           npu-node-feature-discovery-gc-5d97746dbc-wvnx6       1/1     Running            0             9m52s
npu                           npu-node-feature-discovery-master-664666b7bb-7qjfh   1/1     Running            0             9m43s
npu                           npu-node-feature-discovery-worker-8mj5w              1/1     Running            0             9m38s
npu                           npu-operator-5678cd59d4-ljvc8                        1/1     Running            0             19m
openfuyao-system-controller   modify-manifests-master1-rhfqf                       0/1     Completed          0             27m
openfuyao-system-controller   openfuyao-system-controller-798c4f6598-npmjs         1/1     Running            0             29m
openfuyao-system              application-management-service-77457c5c85-g7b6z      2/2     Running            0             28m
openfuyao-system              console-service-7fdb88c9c6-g2kp5                     1/1     Running            0             28m
openfuyao-system              console-website-c78945fcc-dm6s4                      1/1     Running            0             28m
openfuyao-system              local-harbor-chartmuseum-57fdd9949d-5xwh6            1/1     Running            0             28m
openfuyao-system              local-harbor-core-6f7d4cc767-9rp6l                   1/1     Running            0             28m
openfuyao-system              local-harbor-database-0                              1/1     Running            0             28m
openfuyao-system              local-harbor-jobservice-7b8c9bf798-ms8qh             1/1     Running            4 (27m ago)   28m
openfuyao-system              local-harbor-nginx-78b94f7b74-nflct                  1/1     Running            0             28m
openfuyao-system              local-harbor-portal-6b8cbf6747-gq6wd                 1/1     Running            0             28m
openfuyao-system              local-harbor-redis-0                                 1/1     Running            0             28m
openfuyao-system              local-harbor-registry-7879c9d46d-d8wq6               2/2     Running            0             28m
openfuyao-system              marketplace-service-5cf7cd6f5b-bmknq                 2/2     Running            0             28m
openfuyao-system              monitoring-service-6fd8dbd59f-bwq96                  2/2     Running            0             28m
openfuyao-system              oauth-server-64f6545c48-2hm7z                        1/1     Running            0             26m
openfuyao-system              oauth-webhook-78cb864fc5-gsmw9                       1/1     Running            0             26m
openfuyao-system              plugin-management-service-6fd8f64cc4-zvql8           2/2     Running            0             26m
openfuyao-system              user-management-operator-9bb7bf64-6q9xg              1/1     Running            0             26m
openfuyao-system              web-terminal-service-6f64b888f9-gdxpb                1/1     Running            0             26m
volcano-system                volcano-controllers-6ffb787f8d-chk69                 0/1     Pending            0             7m58s
volcano-system                volcano-scheduler-867f9784bb-tvhzh                   0/1     Pending            0             7m57s

[root@master1 fuyao-26-03]# kubectl  -n kube-system logs ascend-device-plugin-5hmhf
Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init)
[INFO]     2026/03/27 02:26:56.030212 1       hwlog/api.go:108    devicePlugin.log's logger init success
[INFO]     2026/03/27 02:26:56.030715 1       main.go:187    ascend device plugin starting and the version is v6.0.0_linux-aarch64
[INFO]     2026/03/27 02:26:56.030759 1       main.go:188    ascend device plugin starting scene is center
2026/03/27 02:26:56 command exec failed, exit status 1
[ERROR]    2026/03/27 02:26:56.032950 1       devmanager/devmanager.go:95    deviceManager init failed, prepare dcmi failed, err: cannot found valid driver lib, fromEnv: lib path is invalid, [], fromLdCmd: can't find valid lib: EOF
[ERROR]    2026/03/27 02:26:56.033013 1       main.go:203    init devmanager failed, err: auto init failed, err: get chip info failed, err: device Manager is nil, may encounter an exception during initialization. You can check the system log to confirm

环境检查

参考 https://gitcode.com/openFuyao/sig-installation/blob/master/docs/zh/user_guide/cluster_installation_deployment/environment_pre_check_tool_guide.md
wget https://openfuyao.obs.cn-north-4.myhuaweicloud.com/openFuyao/env-check/releases/download/latest/bin/linux/arm64/envCheck
wget https://openfuyao.obs.cn-north-4.myhuaweicloud.com/openFuyao/env-check/releases/download/latest/bin/linux/arm64/config.json

文件检查:

[root@localhost env]# ./envCheck query
[INFO][2026-03-27 02:53:25] envCheck tool start
[INFO][2026-03-27 02:53:25] mode: fileQuery
[INFO][2026-03-27 02:53:25] file query start ......
[INFO][2026-03-27 02:53:25] query path: [/root/.kube /etc/kubernetes /usr/local/bin/crictl /etc/sysctl.d/k8s.conf /etc/systemd/system/kubelet.service /etc/systemd/system/kubelet.service.d /var/lib/etcd /var/lib/kubelet /run/containerd/containerd.sock /usr/lib/systemd/system/kubelet.service.d /var/run/containerd/containerd.sock /var/run/docker.sock]
[INFO][2026-03-27 02:53:25] path not exist: /root/.kube
[INFO][2026-03-27 02:53:25] path not exist: /etc/kubernetes
[INFO][2026-03-27 02:53:25] path not exist: /usr/local/bin/crictl
[INFO][2026-03-27 02:53:25] path not exist: /etc/sysctl.d/k8s.conf
[INFO][2026-03-27 02:53:25] path not exist: /etc/systemd/system/kubelet.service
[INFO][2026-03-27 02:53:25] path not exist: /etc/systemd/system/kubelet.service.d
[INFO][2026-03-27 02:53:25] path not exist: /var/lib/etcd
[INFO][2026-03-27 02:53:25] path not exist: /var/lib/kubelet
[INFO][2026-03-27 02:53:25] path not exist: /run/containerd/containerd.sock
[INFO][2026-03-27 02:53:25] path not exist: /usr/lib/systemd/system/kubelet.service.d
[INFO][2026-03-27 02:53:25] path not exist: /var/run/containerd/containerd.sock
[WARNING][2026-03-27 02:53:25] file exist: /var/run/docker.sock
[INFO][2026-03-27 02:53:25] file query completed

Query Time: 2026-03-27 02:53:25

+-------------------------------------------+---------+---------+-------+--------+-------------+
|                   Path                    | Exists  |  Type   | Owner | Group  | Permissions |
+-------------------------------------------+---------+---------+-------+--------+-------------+
|                /root/.kube                | Missing | Missing |-------+--------+-------------+
|              /etc/kubernetes              | Missing | Missing |-------+--------+-------------+
|           /usr/local/bin/crictl           | Missing | Missing |-------+--------+-------------+
|          /etc/sysctl.d/k8s.conf           | Missing | Missing |-------+--------+-------------+
|    /etc/systemd/system/kubelet.service    | Missing | Missing |-------+--------+-------------+
|   /etc/systemd/system/kubelet.service.d   | Missing | Missing |-------+--------+-------------+
|               /var/lib/etcd               | Missing | Missing |-------+--------+-------------+
|             /var/lib/kubelet              | Missing | Missing |-------+--------+-------------+
|      /run/containerd/containerd.sock      | Missing | Missing |-------+--------+-------------+
| /usr/lib/systemd/system/kubelet.service.d | Missing | Missing |-------+--------+-------------+
|    /var/run/containerd/containerd.sock    | Missing | Missing |-------+--------+-------------+
|           /var/run/docker.sock            | Exists  |  File   | root  | docker | Srw-rw----  |
+-------------------------------------------+---------+---------+-------+--------+-------------+

+-------------------+-------+
|      Summary      | Count |
+-------------------+-------+
|   Total Checked   |  12   |
|   Total Exists    |   1   |
|   Total Missing   |  11   |
| Total Directories |   0   |
|    Total Files    |   1   |
+-------------------+-------+


[INFO][2026-03-27 02:53:25] completed

程序存在性检测:

[root@localhost env]# ./envCheck check
[INFO][2026-03-27 02:54:04] envCheck tool start
[INFO][2026-03-27 02:54:04] mode: programCheck
[INFO][2026-03-27 02:54:04] program check start......
[INFO][2026-03-27 02:54:04] os: linux, arch: arm64
[INFO][2026-03-27 02:54:04] program list to check: [docker kubectl containerd]
[INFO][2026-03-27 02:54:04] check program: docker
[WARNING][2026-03-27 02:54:04] docker installed - version: Docker version 18.09.0, build d51e3ad
[INFO][2026-03-27 02:54:04] check program: kubectl
[INFO][2026-03-27 02:54:04] not install: kubectl
[INFO][2026-03-27 02:54:04] check program: containerd
[WARNING][2026-03-27 02:54:04] containerd installed - version: time="2026-03-27T02:54:04Z" level=warning msg="init error, wrong runtimeTimeout format: time: invalid duration """ 
containerd  version:1.2.0.320.oe2203sp4 871075eb7cc979944ba2d987719cb534bbb87e5c
[INFO][2026-03-27 02:54:04] program check completed
[WARNING][2026-03-27 02:54:04] detected installed application(s): docker, containerd. Please uninstall it(them) yourself

Check Time: 2026-03-27 02:54:04

+------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
|  Program   |    Status     |                                                                                               Version                                                                                                |        Path         |
+------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
|   docker   |   Installed   |                                                                                Docker version 18.09.0, build d51e3ad                                                                                 |   /usr/bin/docker   |
|  kubectl   | Not Installed |                                                                                               Unknown                                                                                                |      Not found      |
| containerd |   Installed   | time="2026-03-27T02:54:04Z" level=warning msg="init error, wrong runtimeTimeout format: time: invalid duration """ 
containerd  version:1.2.0.320.oe2203sp4 871075eb7cc979944ba2d987719cb534bbb87e5c | /usr/bin/containerd |
+------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+

+-----------------+-------+
|     Summary     | Count |
+-----------------+-------+
|  Total Checked  |   3   |
| Total Installed |   2   |
|  Total Missing  |   1   |
+-----------------+-------+


[INFO][2026-03-27 02:54:04] completed

在线部署

部署完毕浏览 NPU Operator 文档才发现需满足 openEuler 22.03 lts 条件,故重装
https://docs.openfuyao.cn/zh/docs/v25.12/user_guide/npu_operator.html#%E5%AE%89%E8%A3%85
  • CPU: Kunpeng-920
  • OS: openEuler 22.03 (LTS-SP4) aarch64
  • Fuyao Version: v26.03 rc3
  • docker: 2:18.09.0-346.oe2403sp3

使用 openEuler 22.03 (LTS-SP4) aarch64 cloud 镜像全新安装、扩容硬盘后部署。

[root@localhost fuyao-26.3-rc3]# ./bkeadm_linux_arm64 init --otherRepo cr.openfuyao.cn/openfuyao/bke-online-installed:latest
--hostIP:            10.17.30.131
--domain:            deploy.bocloud.k8s
--kubernetesPort:    36443
--imageRepoPort:     40443
--yumRepoPort:       40080
--chartRepoPort:     38080
--ntpServer:         cn.pool.ntp.org:123
--runtime:           containerd
--runtimeStorage:    /var/lib/containerd
--clusterAPI:        1.2.1
--oFVersion:         v26.03-rc.3
--versionUrl:        https://openfuyao.obs.cn-north-4.myhuaweicloud.com/openFuyao/version-config/
--enableNTP:         true
--agentHealthPort:   58080
--otherRepo:         cr.openfuyao.cn/openfuyao/bke-online-installed:latest
Confirm the parameters, press Y to continue N will exit. [Y/N]? y
[bke][2026-03-27 02:57:16][INFO] BKE initialize ...
[bke][2026-03-27 02:57:16][INFO] HOSTNAME: localhost
[bke][2026-03-27 02:57:16][INFO] PLATFORM: openeuler
[bke][2026-03-27 02:57:16][INFO] Version:  22.03
[bke][2026-03-27 02:57:16][INFO] KERNEL:   5.10.0-216.0.0.115.oe2203sp4.aarch64
[bke][2026-03-27 02:57:16][INFO] GOOS:     linux
[bke][2026-03-27 02:57:16][INFO] ARCH:     arm64
[bke][2026-03-27 02:57:16][INFO] CPU:      8
[bke][2026-03-27 02:57:16][INFO] MEMORY:   16G
[bke][2026-03-27 02:57:16][INFO] BKE Console: ENABLED
[bke][2026-03-27 02:57:16][INFO] BKE initialize environment check...
[bke][2026-03-27 02:57:16][WARN] Client authentication enabled but no credentials provided
2026-03-27T02:57:16.485Z        info    infrastructure/infrastructure.go:53     The docker client is ready.
[bke][2026-03-27 02:57:16][INFO] set up the host machine zone
[bke][2026-03-27 02:57:16][INFO] set ntp server
Fri Mar 27 02:57:15 AM CST 2026

[bke][2026-03-26 18:57:15][INFO] config local source
[bke][2026-03-26 18:57:15][INFO] 在线模式:domain:deploy.bocloud.k8s 绑定到默认IP:10.17.30.131
[bke][2026-03-26 18:57:49][INFO] iptables -V output: iptables v1.8.7 (legacy)
[bke][2026-03-26 18:57:49][INFO] workDir /bke mod permission success
[bke][2026-03-26 18:57:49][ERROR] Failed to start the container service, open /bke/mount/source_registry/files: no such file or directory

最新版似乎修改了命令参数,使用最新参数正常。

[root@stl-fuyao-2603 fuyao-26.3-rc3]# ./bkeadm_linux_arm64 init --onlineImage cr.openfuyao.cn/openfuyao/bke-online-installed:latest
--hostIP:            10.17.30.131
--domain:            deploy.bocloud.k8s
--kubernetesPort:    36443
--imageRepoPort:     40443
--yumRepoPort:       40080
--chartRepoPort:     38080
--ntpServer:         cn.pool.ntp.org:123
--runtime:           containerd
--runtimeStorage:    /var/lib/containerd
--clusterAPI:        1.2.1
--oFVersion:         v26.03-rc.3
--versionUrl:        https://openfuyao.obs.cn-north-4.myhuaweicloud.com/openFuyao/version-config/
--enableNTP:         true
--agentHealthPort:   58080
--onlineImage:       cr.openfuyao.cn/openfuyao/bke-online-installed:latest
Confirm the parameters, press Y to continue N will exit. [Y/N]? y
[bke][2026-03-27 11:36:51][INFO] BKE initialize ...
[bke][2026-03-27 11:36:51][INFO] HOSTNAME: stl-fuyao-2603
[bke][2026-03-27 11:36:51][INFO] PLATFORM: openeuler
[bke][2026-03-27 11:36:51][INFO] Version:  22.03
[bke][2026-03-27 11:36:51][INFO] KERNEL:   5.10.0-216.0.0.115.oe2203sp4.aarch64
[bke][2026-03-27 11:36:51][INFO] GOOS:     linux
[bke][2026-03-27 11:36:51][INFO] ARCH:     arm64
[bke][2026-03-27 11:36:51][INFO] CPU:      8
[bke][2026-03-27 11:36:51][INFO] MEMORY:   16G
[bke][2026-03-27 11:36:51][INFO] BKE Console: ENABLED
[bke][2026-03-27 11:36:51][INFO] BKE initialize environment check...
[bke][2026-03-27 11:36:51][WARN] Client authentication enabled but no credentials provided
[bke][2026-03-27 11:36:51][INFO] set up the host machine zone
[bke][2026-03-27 11:36:51][INFO] set ntp server
Fri Mar 27 11:36:51 AM CST 2026

[bke][2026-03-27 11:36:51][INFO] config local source
[bke][2026-03-27 11:36:51][INFO] 在线模式:domain:deploy.bocloud.k8s 绑定到默认IP:10.17.30.131
[bke][2026-03-27 11:36:51][WARN] 无法解析私有仓库地址,跳过CA证书配置
[bke][2026-03-27 11:36:51][INFO] Using client certificate authentication(CA only)
[bke][2026-03-27 11:36:51][INFO] Download source file...
[bke][2026-03-27 11:36:53][INFO] Writing data...
...
[bke][2026-03-27 13:48:19][INFO] containerd sandbox image: hub.oepkgs.net/openfuyao/pause:3.9
[bke][2026-03-27 13:48:19][INFO] Created offline special hosts.toml: /etc/containerd/certs.d/127.0.0.1:40443/hosts.toml
[bke][2026-03-27 13:48:19][INFO] wait for containerd to start
[bke][2026-03-27 13:48:19][INFO] Waiting for containerd to be ready
...
成功解压: /opt/cni/bin/vrf , 共处理了 4007138 个字符
[bke][2026-03-27 13:48:20][INFO] Start the base dependency service
[bke][2026-03-27 13:48:21][INFO] Image hub.oepkgs.net/openfuyao/registry:2.8.1 is downloading
[bke][2026-03-27 13:48:29][INFO] Wait for the container mirroring service to start...
[bke][2026-03-27 13:48:34][INFO] The container mirroring service is started by containerd. 
[bke][2026-03-27 13:48:34][INFO] Image hub.oepkgs.net/openfuyao/nginx:1.23.0-alpine is downloading
[bke][2026-03-27 13:48:43][INFO] Wait for the container yum service to start...
[bke][2026-03-27 13:48:48][INFO] The container yum service is started. 
[bke][2026-03-27 13:48:48][INFO] Image hub.oepkgs.net/openfuyao/helm/chartmuseum:v0.16.2 is downloading
[bke][2026-03-27 13:49:05][INFO] Wait for the chart mirroring service to start...
[bke][2026-03-27 13:49:10][INFO] The chart mirroring service is started. 
[bke][2026-03-27 13:49:10][INFO] Image hub.oepkgs.net/openfuyao/openebs/nfs-server-alpine:0.9.0 is downloading
[bke][2026-03-27 13:49:17][INFO] Wait for the nfs mirroring service to start...
[bke][2026-03-27 13:49:22][INFO] The nfs mirroring service is started. 
[bke][2026-03-27 13:49:22][INFO] Image hub.oepkgs.net/openfuyao/rancher/k3s:v1.25.16-k3s4 is downloading
2026-03-27T13:50:07.914+0800    info    k3s/k3s.go:314  params: onlineImage=cr.openfuyao.cn/openfuyao/bke-online-installed:latest otherRepo=, otherRepoIp=, hostIP=10.17.30.131, imageRepo=deploy.bocloud.k8s, imageRepoPort=40443, kubernetesPort=36443
[bke][2026-03-27 13:50:07][INFO] Start the local Kubernetes cluster...
[bke][2026-03-27 13:50:10][ERROR] Failed to copy kubectl from the container
[bke][2026-03-27 13:50:10][ERROR] Failed to start kubernetes exit status 1
[bke][2026-03-27 13:50:10][ERROR] Failed to start cluster API, exit status 1

似乎启动失败,居然是因为 tar 没有安装?

[root@stl-fuyao-2603 fuyao-26.3-rc3]# nerdctl cp kubernetes:/bin/k3s /tmp/test-k3s
FATA[0000] unable to copy: failed to find `tar` binary 

安装 tar 后解决,顺利部署。

Npu Operator

NAMESPACE                     NAME                                                          READY   STATUS             RESTARTS        AGE
cluster-system                bkeagent-deployer-h2flc                                       1/1     Running            0               65m
ingress-nginx                 ingress-nginx-admission-create-xqwmf                          0/1     Completed          0               62m
ingress-nginx                 ingress-nginx-admission-patch-4dcjr                           0/1     Completed          1               62m
ingress-nginx                 ingress-nginx-controller-p5nrm                                1/1     Running            0               62m
kube-system                   ascend-device-plugin-xtfkb                                    0/1     CrashLoopBackOff   7 (4m34s ago)   16m
kube-system                   calico-kube-controllers-6d75d78f5d-tfjvf                      1/1     Running            0               65m
kube-system                   calico-node-xtd99                                             1/1     Running            0               65m
kube-system                   coredns-6c6fdbdb66-24cn4                                      1/1     Running            0               65m
kube-system                   coredns-6c6fdbdb66-w29mt                                      1/1     Running            1 (63m ago)     65m
kube-system                   etcd-master1                                                  1/1     Running            0               64m
kube-system                   kube-apiserver-master1                                        1/1     Running            0               55m
kube-system                   kube-controller-manager-master1                               1/1     Running            1 (55m ago)     65m
kube-system                   kube-proxy-xxgz4                                              1/1     Running            0               65m
kube-system                   kube-scheduler-master1                                        1/1     Running            1 (55m ago)     65m
kube-system                   metrics-server-586f979f47-4b6fz                               1/1     Running            0               54m
mindx-dl                      ascend-operator-manager-5d4d89f675-tbjpv                      0/1     Pending            0               15m
mindx-dl                      clusterd-5588c5dc88-2fd5g                                     1/1     Running            0               16m
mindx-dl                      resilience-controller-8686bbd76f-jv5lp                        1/1     Running            0               16m
monitoring                    alertmanager-main-0                                           2/2     Running            0               51m
monitoring                    alertmanager-main-1                                           2/2     Running            0               51m
monitoring                    alertmanager-main-2                                           2/2     Running            0               51m
monitoring                    blackbox-exporter-6d6fbbfc96-cl6sg                            3/3     Running            0               56m
monitoring                    kube-state-metrics-677558db89-9rtbl                           3/3     Running            0               56m
monitoring                    node-exporter-4mzzl                                           2/2     Running            0               56m
monitoring                    prometheus-k8s-0                                              2/2     Running            0               51m
monitoring                    prometheus-k8s-1                                              2/2     Running            0               51m
monitoring                    prometheus-operator-5cb64c846d-8m55t                          2/2     Running            0               56m
npu                           ascend-runtime-containerd-pg5q4                               1/1     Running            0               16m
npu                           npu-driver-66dqf                                              0/1     Init:0/1           0               103s
npu                           npu-feature-discovery-rr68r                                   1/1     Running            0               17m
npu                           npu-operator-5858d99c89-lgd24                                 1/1     Running            0               17m
npu                           npu-operator-node-feature-discovery-gc-5cf8bc768d-6w4b4       1/1     Running            0               17m
npu                           npu-operator-node-feature-discovery-master-5985b5cfcd-swdqs   1/1     Running            0               17m
npu                           npu-operator-node-feature-discovery-worker-q5ct6              1/1     Running            0               17m
openfuyao-system-controller   modify-manifests-master1-sjxrv                                0/1     Completed          0               55m
openfuyao-system-controller   openfuyao-system-controller-8444679b95-jv8jg                  1/1     Running            0               65m
openfuyao-system              application-management-service-75799d4dd6-8vm9w               2/2     Running            0               55m
openfuyao-system              console-service-84bbd85575-grtwx                              1/1     Running            0               55m
openfuyao-system              console-website-855c9d8f65-8btkl                              1/1     Running            0               61m
openfuyao-system              local-harbor-chartmuseum-7f96745849-d7vzk                     1/1     Running            0               56m
openfuyao-system              local-harbor-core-5f847798b8-khkf4                            1/1     Running            1 (51m ago)     56m
openfuyao-system              local-harbor-database-0                                       1/1     Running            0               56m
openfuyao-system              local-harbor-jobservice-7d67f4f887-svklq                      1/1     Running            3 (47m ago)     56m
openfuyao-system              local-harbor-nginx-6449749746-pbzkc                           1/1     Running            0               56m
openfuyao-system              local-harbor-portal-78bf65c9-rqmk6                            1/1     Running            0               56m
openfuyao-system              local-harbor-redis-0                                          1/1     Running            0               56m
openfuyao-system              local-harbor-registry-65884895bf-ff2wd                        2/2     Running            0               56m
openfuyao-system              marketplace-service-5c79cbcbfc-swg9r                          2/2     Running            0               55m
openfuyao-system              monitoring-service-79fc57c6b4-47vbm                           2/2     Running            0               56m
openfuyao-system              oauth-server-68b6655d95-c8pzx                                 1/1     Running            0               54m
openfuyao-system              oauth-webhook-6995d46758-g47xn                                1/1     Running            0               54m
openfuyao-system              plugin-management-service-84bfcd6565-2bmz2                    2/2     Running            0               54m
openfuyao-system              user-management-operator-8d79bd8b8-jr4sk                      1/1     Running            0               54m
openfuyao-system              web-terminal-service-6d858d974-hpjw6                          1/1     Running            0               54m
volcano-system                volcano-controllers-6ffb787f8d-mxxph                          1/1     Running            0               16m
volcano-system                volcano-scheduler-867f9784bb-vsswv                            0/1     Pending            0               16m
kube-system                   ascend-device-plugin-sbxr7                           0/1     CrashLoopBackOff   6 (100s ago)   9m57s
[root@master1 ~]# kubectl -n kube-system logs ascend-device-plugin-sbxr7  
Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init)  
[INFO]     2026/03/27 07:27:47.189809 1       hwlog/api.go:108    devicePlugin.log's logger init success  
[INFO]     2026/03/27 07:27:47.190348 1       main.go:187    ascend device plugin starting and the version is v6.0.0_linux-aarch64  
[INFO]     2026/03/27 07:27:47.190416 1       main.go:188    ascend device plugin starting scene is center  
2026/03/27 07:27:47 command exec failed, exit status 1  
[ERROR]    2026/03/27 07:27:47.192892 1       devmanager/devmanager.go:95    deviceManager init failed, prepare dcmi failed, err: cannot found valid driver lib, fromEnv: lib path is invalid, [], fromLdCmd: can't find valid lib: EOF  
[ERROR]    2026/03/27 07:27:47.192970 1       main.go:203    init devmanager failed, err: auto init failed, err: get chip info failed, err: device Manager is nil, may encounter an exception during initialization. You can check the system log to confirm

驱动安装失败:

[root@master1 ~]# kubectl -n npu logs -f npu-driver-lxv5c -c npu-driver-installer
Checking if /mnt/usr/local/sbin/npu-smi exists...
master1 is not an option, please use -h to view help
[2026-03-27 15:55:00, [INFO] No operation specified, default install operation on node: --
[2026-03-27 15:55:00, [INFO] install npu-driver
[2026-03-27 15:55:00, [INFO] copy npu-install to host
[2026-03-27 15:55:00, [INFO] copy npu-install to host success
[2026-03-27 15:55:00,303677070] [INFO] Install dependency packages
[2026-03-27 15:55:00,305026980] [INFO] Using yum for package installation
5 files removed
repo                                            4.3 MB/s |  33 kB     00:00    
Metadata cache created.
[2026-03-27 15:55:01,143913450] [INFO] jq is already installed, skipping...
[2026-03-27 15:55:01,150749780] [INFO] wget is already installed, skipping...
[2026-03-27 15:55:01,156894190] [INFO] Installing unzip...
No match for argument: unzip
Error: Unable to find a match: unzip
[2026-03-27 15:55:01,599984660] [FATAL] Failed to install unzip using yum
[2026-03-27 15:55:01, [FATAL] install failed: --
因为主机缺少 unizp 导致失败,但是该容器没有任何异常,无法感知错误。
从清单可以看到 ascend-device-plugin-sbxr7 容器一直由于找不到驱动异常退出,而实际原因是 npu-driver-66dqf pod 没有正常安装驱动,但没有退出,反而正常运行。模糊了实际错误点。

安装 unzip 后正常:

[root@master1 ~]# kubectl -n npu logs -f npu-driver-66dqf -c npu-driver-installer
Checking if /mnt/usr/local/sbin/npu-smi exists...
master1 is not an option, please use -h to view help
[2026-03-27 15:57:58, [INFO] No operation specified, default install operation on node: --
[2026-03-27 15:57:58, [INFO] install npu-driver
[2026-03-27 15:57:58, [INFO] copy npu-install to host
[2026-03-27 15:57:58, [INFO] copy npu-install to host success
[2026-03-27 15:57:58,069791170] [INFO] Install dependency packages
[2026-03-27 15:57:58,070929430] [INFO] Using yum for package installation
48 files removed
repo                                            5.1 MB/s |  33 kB     00:00    
OS                                               67 MB/s | 3.3 MB     00:00    
everything                                       76 MB/s |  17 MB     00:00    
EPOL                                             65 MB/s | 4.7 MB     00:00    
debuginfo                                        69 MB/s | 3.9 MB     00:00    
source                                           59 MB/s | 1.8 MB     00:00    
update                                           79 MB/s |  71 MB     00:00 

可以看到花费了大量时间在 init 阶段,这样才是正常的:

npu                           npu-driver-66dqf                                              0/1     Init:0/1           0             3m51s

虚拟机运行似乎会有这个错误

kubectl  -n kube-system delete pod ascend-device-plugin-xtfkb

[Driver] [2026-03-27 16:03:40] [INFO]upgradePercentage:100%
[Driver] [2026-03-27 16:03:42] [INFO]Driver package installed successfully! The new version takes effect immediately.
[Driver] [2026-03-27 16:03:42] [INFO]End time: 2026-03-27 16:03:42
[Firmware] [2026-03-27 16:03:43] [INFO]Start time: 2026-03-27 16:03:43
[Firmware] [2026-03-27 16:03:43] [INFO]LogFile: /var/log/ascend_seclog/ascend_install.log
[Firmware] [2026-03-27 16:03:43] [INFO]OperationLogFile: /var/log/ascend_seclog/operation.log
[Firmware] [2026-03-27 16:03:43] [WARNING]Do not power off or restart the system during the installation/upgrade
[Firmware] [2026-03-27 16:03:43] [ERROR]Not a physical-machine, firmware upgrade does not support.
[Firmware] [2026-03-27 16:03:43] [INFO]End time: 2026-03-27 16:03:43 

但驱动等已经部署完毕:

[root@master1 ~]# npu-smi info  
+--------------------------------------------------------------------------------------------------------+  
| npu-smi 24.1.rc3                                 Version: 24.1.rc3                                     |  
+-------------------------------+-----------------+------------------------------------------------------+  
| NPU     Name                  | Health          | Power(W)     Temp(C)           Hugepages-Usage(page) |  
| Chip    Device                | Bus-Id          | AICore(%)    Memory-Usage(MB)                        |  
+===============================+=================+======================================================+  
| 176     310P3                 | OK              | NA           56                0     / 0             |  
| 0       0                     | 0000:00:16.0    | 0            1838 / 21527                            |  
+===============================+=================+======================================================+  
+-------------------------------+-----------------+------------------------------------------------------+  
| NPU     Chip                  | Process id      | Process name             | Process memory(MB)        |  
+===============================+=================+======================================================+  
| No running processes found in NPU 176                                                                  |  
+===============================+=================+======================================================+

但是这个容器依然无法正常运行:

[root@master1 ~]# kubectl -n kube-system logs -f -l name=ascend-device-plugin-ds -c device-plugin-01
[INFO]     2026/03/27 08:19:41.677826 1       hwlog/api.go:108    devicePlugin.log's logger init success
[INFO]     2026/03/27 08:19:41.678051 1       main.go:187    ascend device plugin starting and the version is v6.0.0_linux-aarch64
[INFO]     2026/03/27 08:19:41.678116 1       main.go:188    ascend device plugin starting scene is center
[INFO]     2026/03/27 08:19:41.900653 1       devmanager/devmanager.go:104    the dcmi version is 24.1.rc3
[ERROR]    2026/03/27 08:19:41.900744 1       devmanager/devmanager.go:211    get error card quantity: 0
[ERROR]    2026/03/27 08:19:41.900780 1       devmanager/devmanager.go:195    get card list failed for init
[ERROR]    2026/03/27 08:19:41.900828 1       main.go:203    init devmanager failed, err: auto init failed, err: get card list failed for init

部署后防火墙问题

另外发现 openEuler iptables 默认,还是 openFuyao 默认配置,iptables 默认 FORWARD 为 Drop,会导致部署后无法访问。

[root@stl-fuyao-2603 ~]# iptables -L -n
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy DROP)
target     prot opt source               destination         
CNI-ISOLATION-STAGE-1  all  --  0.0.0.0/0            0.0.0.0/0            /* CNI firewall plugin rules (ingressPolicy: same-bridge) */
CNI-FORWARD  all  --  0.0.0.0/0            0.0.0.0/0            /* CNI firewall plugin rules */

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

需 配置解决

iptables -P FORWARD ACCEPT

部署时软件源问题

[root@master1 ~]# cp /etc/yum.repos.d/b
bak/      bke.repo

默认移除系统自带软件源,部署后无法再安装软件,需自行配置。
能否通过更加灵活的方式,如配置优先级的方式来规避直接移除软件源。

压缩工具检查

建议参考这种:

root@hosthatch-us1:~# sudo -v ; curl https://rclone.org/install.sh | sudo bash
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current               
                                 Dload  Upload   Total   Spent    Left  Speed
100  4734  100  4734    0     0   8458      0 --:--:-- --:--:-- --:--:--  8468
                                                                                              
None of the supported tools for extracting zip archives (unzip 7z busybox) were found. Please install one of them and try again.

带有明确错误说明。

openFuyao InferNex AI推理集成部署 310P(300I Pro) 环境问题记录及解决

作者 tl.s
2026年4月13日 16:53

AI推理集成部署(InferNex)是一个专为云原生环境下AI推理服务优化所设计的端到端集成部署方案。该方案基于Kubernetes Gateway API Inference Extension (GIE) 和主流LLM技术栈构建,通过Helm Chart将开源网关、智能路由、高性能推理后端、全局KVCache管理、扩缩容决策框架及推理可观测体系等核心加速模块无缝集成。它提供从请求接入、动态路由、推理执行到资源管理与监控的完整加速链路,旨在提升推理吞吐量并降低TTFT/TPOT时延,实现一站式的高效AI服务部署体验。

相关的文档如下:

因为官方仅针对 910 做了验证,手头只有一张 310P ,理论上是可以跑起来,但是需要做一系列修改,本文记录部署遇到的各种问题及其解决方案。

部署后有几个 pod 一直起不来:

NAMESPACE                     NAME                                                          READY   STATUS      RESTARTS       AGE  
ai-inference                  vllm-pd-2p1d-01-decode-54cc4c7579-5h62w                       0/1     Pending     0              5d18h  
ai-inference                  vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg                       0/1     Init:0/2    0              5d  
ai-inference                  vllm-pd-2p1d-01-prefill-5c546dbcc-thmkd                       0/1     Pending     0              5d18h  
ai-inference                  vllm-pd-2p1d-01-prefill-fd68f87cf-jjdlc                       0/1     Pending     0              5d  

hccn 问题

似乎是 hccn 找不到

[root@master1 ~]# kubectl  -n ai-inference describe pod vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg 
Name:             vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg
Namespace:        ai-inference
Priority:         0
Service Account:  default
Node:             master1/10.17.30.131
Start Time:       Tue, 07 Apr 2026 09:31:33 +0800
Labels:           app.kubernetes.io/instance=infernex-vllm-pd-2p1d-01
                  app.kubernetes.io/name=inference-backend
                  openfuyao.com/dpSize=1
                  openfuyao.com/engine=vllm
                  openfuyao.com/model=qwen-qwen3-8b
                  openfuyao.com/pdGroupID=qwen3-8b-pd-01
                  openfuyao.com/pdRole=decode
                  openfuyao.com/ppSize=1
                  openfuyao.com/tpSize=1
                  pod-template-hash=6cd64bc69c
Annotations:      checksum/config: 476b32f01fc96ff2896aee7fce288cd2b58cdb2ac825d1a22518798806847a2c
                  huawei.com/AscendReal: Ascend310P-0
                  huawei.com/kltDev: Ascend310P-0
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/vllm-pd-2p1d-01-decode-6cd64bc69c
Init Containers:
  mooncake-config-init:
    Container ID:  
    Image:         hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      set -e
      CONFIG_PATH="/app/mooncake.json"
      mkdir -p "$(dirname "$CONFIG_PATH")"
      cat > /tmp/mooncake_config.tpl << 'EOF'
        local_hostname: "$POD_IP"
        metadata_server: "redis://redis-service:6379"
        master_server_address: "mooncake-master-service:30089"
        device_name: ""
        protocol: "ascend"
        global_segment_size: 42949672960
        use_ascend_direct: true
        
      EOF
      POD_IP_VALUE="${POD_IP:-0.0.0.0}"
      sed "s/\$POD_IP/${POD_IP_VALUE}/g" /tmp/mooncake_config.tpl | yq eval - -o=json > "$CONFIG_PATH"
      
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      POD_NAME:  vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg (v1:metadata.name)
      POD_IP:     (v1:status.podIP)
    Mounts:
      /app from mooncake-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jswfw (ro)
  huggingface-download:
    Container ID:  
    Image:         cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      hf
      download
      Qwen/Qwen3-8B
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      HF_HUB_OFFLINE:        0
      VLLM_USE_V1:           1
      GLOO_SOCKET_IFNAME:    eth0
      TP_SOCKET_IFNAME:      eth0
      HCCL_SOCKET_IFNAME:    eth0
      MOONCAKE_CONFIG_PATH:  /app/mooncake.json
    Mounts:
      /root/.cache from rootcache (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jswfw (ro)
Containers:
  decode-engine:
    Container ID:  
    Image:         hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0
    Image ID:      
    Port:          8000/TCP (decode-port)
    Host Port:     0/TCP (decode-port)
    Command:
      /bin/bash
      -c
    Args:
      # PHYSICAL_DEVICES stands for the physical devices assigned to the container, use for vllm ascend 0.10.x
      export PHYSICAL_DEVICES=$(ls /dev/davinci* 2>/dev/null | grep -o '[0-9]\+' | sort -n | paste -sd',' -)
      
      # start vllm service
      vllm serve Qwen/Qwen3-8B \
        --served-model-name Qwen/Qwen3-8B \
        --trust-remote-code \
        --no-enable-prefix-caching \
        --port 8000 \
        --tensor-parallel-size 1 \
        --max-model-len 10000 \
        --max-num-batched-tokens 40960 \
        --data-parallel-size 1 \
        --pipeline-parallel-size 1 \
        --gpu-memory-utilization 0.8 \
        --kv-transfer-config '{"engine_id":"'$POD_NAME'","kv_connector":"MultiConnector","kv_connector_extra_config":{"connectors":[{"kv_buffer_device":"npu","kv_connector":"MooncakeConnectorV1","kv_connector_extra_config":{"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2},"use_ascend_direct":true},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"},{"kv_buffer_device":"npu","kv_connector":"AscendStoreConnector","kv_connector_extra_config":{"backend":"mooncake","decode":{"dp_size":1,"tp_size":1},"lookup_rpc_port":"0","prefill":{"dp_size":1,"tp_size":2}},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"}],"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2}},"kv_port":"20001","kv_rank":1,"kv_role":"kv_consumer"}'
      
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                    8
      huawei.com/Ascend310P:  1
      memory:                 64Gi
    Requests:
      cpu:                    4
      huawei.com/Ascend310P:  1
      memory:                 32Gi
    Liveness:                 http-get http://:decode-port/health delay=0s timeout=10s period=10s #success=1 #failure=3
    Readiness:                http-get http://:decode-port/v1/models delay=0s timeout=5s period=10s #success=1 #failure=3
    Startup:                  http-get http://:decode-port/v1/models delay=30s timeout=5s period=30s #success=1 #failure=60
    Environment:
      HF_HUB_OFFLINE:        0
      VLLM_USE_V1:           1
      GLOO_SOCKET_IFNAME:    eth0
      TP_SOCKET_IFNAME:      eth0
      HCCL_SOCKET_IFNAME:    eth0
      MOONCAKE_CONFIG_PATH:  /app/mooncake.json
      POD_NAME:              vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg (v1:metadata.name)
      POD_IP:                 (v1:status.podIP)
    Mounts:
      /app from mooncake-config (ro)
      /dev/shm from shm (rw)
      /etc/ascend_install.info from installinfo (rw)
      /etc/hccn.conf from hccnconf (rw)
      /root/.cache from rootcache (rw)
      /usr/bin/hccn_tool from hccntool (rw)
      /usr/local/Ascend/driver/lib64 from lib64 (rw)
      /usr/local/Ascend/driver/version.info from version (rw)
      /usr/local/bin/npu-smi from npusmi (rw)
      /usr/local/dcmi from dcmi (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jswfw (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  mooncake-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  shm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  24Gi
  dcmi:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/dcmi
    HostPathType:  
  npusmi:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/bin/npu-smi
    HostPathType:  File
  lib64:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/Ascend/driver/lib64
    HostPathType:  
  version:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/Ascend/driver/version.info
    HostPathType:  File
  installinfo:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ascend_install.info
    HostPathType:  File
  hccntool:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/bin/hccn_tool
    HostPathType:  File
  hccnconf:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/hccn.conf
    HostPathType:  File
  rootcache:
    Type:          HostPath (bare host directory volume)
    Path:          /home/llm_cache
    HostPathType:  
  kube-api-access-jswfw:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 30s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 30s
Events:
  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  5d                      default-scheduler  0/1 nodes are available: 1 Insufficient memory. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Warning  FailedScheduling  5d (x2 over 5d)         default-scheduler  0/1 nodes are available: 1 Insufficient memory. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Warning  FailedScheduling  9m32s                   default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint(s). no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  9m20s (x24 over 9m29s)  default-scheduler  0/1 nodes are available: 1 Insufficient huawei.com/Ascend310P. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Normal   Scheduled         8m58s                   default-scheduler  Successfully assigned ai-inference/vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg to master1
  Warning  FailedMount       44s (x12 over 8m58s)    kubelet            MountVolume.SetUp failed for volume "hccntool" : hostPath type check failed: /usr/bin/hccn_tool is not a file

暂时绕过:

touch /usr/bin/hccn_tool
chmod +x /usr/bin/hccn_tool

huggingface-download 失败

[root@master1 ~]# kubectl  -n ai-inference describe pod vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp 
Name:             vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp
Namespace:        ai-inference
Priority:         0
Service Account:  default
Node:             master1/10.17.30.131
Start Time:       Tue, 07 Apr 2026 09:45:19 +0800
Labels:           app.kubernetes.io/instance=infernex-vllm-pd-2p1d-01
                  app.kubernetes.io/name=inference-backend
                  openfuyao.com/dpSize=1
                  openfuyao.com/engine=vllm
                  openfuyao.com/model=qwen-qwen3-8b
                  openfuyao.com/pdGroupID=qwen3-8b-pd-01
                  openfuyao.com/pdRole=decode
                  openfuyao.com/ppSize=1
                  openfuyao.com/tpSize=1
                  pod-template-hash=6cd64bc69c
Annotations:      checksum/config: 476b32f01fc96ff2896aee7fce288cd2b58cdb2ac825d1a22518798806847a2c
                  cni.projectcalico.org/containerID: 32b3384131b69054ca45acc8afe5e272b0ca681ea6d0611b3fec7316e3532e80
                  cni.projectcalico.org/podIP: 192.168.137.155/32
                  cni.projectcalico.org/podIPs: 192.168.137.155/32
                  huawei.com/AscendReal: Ascend310P-0
                  huawei.com/kltDev: Ascend310P-0
Status:           Pending
IP:               192.168.137.155
IPs:
  IP:           192.168.137.155
Controlled By:  ReplicaSet/vllm-pd-2p1d-01-decode-6cd64bc69c
Init Containers:
  mooncake-config-init:
    Container ID:  containerd://4ede488dd17e33f3a980aee6fa4eac3093ad6366c3854fe85436f81f6e1df7bb
    Image:         hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1
    Image ID:      hub.oepkgs.net/openfuyao/mikefarah/yq@sha256:4facc66fdcc785ec961ef7f2185f53f862f462eefe1d50c2eb311c2bb26823e3
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      set -e
      CONFIG_PATH="/app/mooncake.json"
      mkdir -p "$(dirname "$CONFIG_PATH")"
      cat > /tmp/mooncake_config.tpl << 'EOF'
        local_hostname: "$POD_IP"
        metadata_server: "redis://redis-service:6379"
        master_server_address: "mooncake-master-service:30089"
        device_name: ""
        protocol: "ascend"
        global_segment_size: 42949672960
        use_ascend_direct: true
        
      EOF
      POD_IP_VALUE="${POD_IP:-0.0.0.0}"
      sed "s/\$POD_IP/${POD_IP_VALUE}/g" /tmp/mooncake_config.tpl | yq eval - -o=json > "$CONFIG_PATH"
      
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 07 Apr 2026 09:45:20 +0800
      Finished:     Tue, 07 Apr 2026 09:45:20 +0800
    Ready:          True
    Restart Count:  0
    Environment:
      POD_NAME:  vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp (v1:metadata.name)
      POD_IP:     (v1:status.podIP)
    Mounts:
      /app from mooncake-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9jvbd (ro)
  huggingface-download:
    Container ID:  containerd://84e15b7d9e2f7382181e309c1558174ec48e58ad0ae14f92ae0dfff284da76e5
    Image:         cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2
    Image ID:      cr.openfuyao.cn/openfuyao/huggingface-download@sha256:ac86348b5e6934a020c21c4f0ebf81b520194ba8e549f1847ecc7521b82d9a8d
    Port:          <none>
    Host Port:     <none>
    Command:
      hf
      download
      Qwen/Qwen3-8B
    State:          Running
      Started:      Tue, 07 Apr 2026 09:47:33 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 07 Apr 2026 09:45:20 +0800
      Finished:     Tue, 07 Apr 2026 09:47:32 +0800
    Ready:          False
    Restart Count:  1
    Environment:
      HF_HUB_OFFLINE:        0
      VLLM_USE_V1:           1
      GLOO_SOCKET_IFNAME:    eth0
      TP_SOCKET_IFNAME:      eth0
      HCCL_SOCKET_IFNAME:    eth0
      MOONCAKE_CONFIG_PATH:  /app/mooncake.json
    Mounts:
      /root/.cache from rootcache (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9jvbd (ro)
Containers:
  decode-engine:
    Container ID:  
    Image:         hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0
    Image ID:      
    Port:          8000/TCP (decode-port)
    Host Port:     0/TCP (decode-port)
    Command:
      /bin/bash
      -c
    Args:
      # PHYSICAL_DEVICES stands for the physical devices assigned to the container, use for vllm ascend 0.10.x
      export PHYSICAL_DEVICES=$(ls /dev/davinci* 2>/dev/null | grep -o '[0-9]\+' | sort -n | paste -sd',' -)
      
      # start vllm service
      vllm serve Qwen/Qwen3-8B \
        --served-model-name Qwen/Qwen3-8B \
        --trust-remote-code \
        --no-enable-prefix-caching \
        --port 8000 \
        --tensor-parallel-size 1 \
        --max-model-len 10000 \
        --max-num-batched-tokens 40960 \
        --data-parallel-size 1 \
        --pipeline-parallel-size 1 \
        --gpu-memory-utilization 0.8 \
        --kv-transfer-config '{"engine_id":"'$POD_NAME'","kv_connector":"MultiConnector","kv_connector_extra_config":{"connectors":[{"kv_buffer_device":"npu","kv_connector":"MooncakeConnectorV1","kv_connector_extra_config":{"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2},"use_ascend_direct":true},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"},{"kv_buffer_device":"npu","kv_connector":"AscendStoreConnector","kv_connector_extra_config":{"backend":"mooncake","decode":{"dp_size":1,"tp_size":1},"lookup_rpc_port":"0","prefill":{"dp_size":1,"tp_size":2}},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"}],"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2}},"kv_port":"20001","kv_rank":1,"kv_role":"kv_consumer"}'
      
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                    8
      huawei.com/Ascend310P:  1
      memory:                 64Gi
    Requests:
      cpu:                    4
      huawei.com/Ascend310P:  1
      memory:                 32Gi
    Liveness:                 http-get http://:decode-port/health delay=0s timeout=10s period=10s #success=1 #failure=3
    Readiness:                http-get http://:decode-port/v1/models delay=0s timeout=5s period=10s #success=1 #failure=3
    Startup:                  http-get http://:decode-port/v1/models delay=30s timeout=5s period=30s #success=1 #failure=60
    Environment:
      HF_HUB_OFFLINE:        0
      VLLM_USE_V1:           1
      GLOO_SOCKET_IFNAME:    eth0
      TP_SOCKET_IFNAME:      eth0
      HCCL_SOCKET_IFNAME:    eth0
      MOONCAKE_CONFIG_PATH:  /app/mooncake.json
      POD_NAME:              vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp (v1:metadata.name)
      POD_IP:                 (v1:status.podIP)
    Mounts:
      /app from mooncake-config (ro)
      /dev/shm from shm (rw)
      /etc/ascend_install.info from installinfo (rw)
      /etc/hccn.conf from hccnconf (rw)
      /root/.cache from rootcache (rw)
      /usr/bin/hccn_tool from hccntool (rw)
      /usr/local/Ascend/driver/lib64 from lib64 (rw)
      /usr/local/Ascend/driver/version.info from version (rw)
      /usr/local/bin/npu-smi from npusmi (rw)
      /usr/local/dcmi from dcmi (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9jvbd (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  mooncake-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  shm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  24Gi
  dcmi:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/dcmi
    HostPathType:  
  npusmi:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/bin/npu-smi
    HostPathType:  File
  lib64:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/Ascend/driver/lib64
    HostPathType:  
  version:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/Ascend/driver/version.info
    HostPathType:  File
  installinfo:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ascend_install.info
    HostPathType:  File
  hccntool:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/bin/hccn_tool
    HostPathType:  File
  hccnconf:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/hccn.conf
    HostPathType:  File
  rootcache:
    Type:          HostPath (bare host directory volume)
    Path:          /home/llm_cache
    HostPathType:  
  kube-api-access-9jvbd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 30s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 30s
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  3m20s               default-scheduler  0/1 nodes are available: 1 Insufficient huawei.com/Ascend310P. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Normal   Scheduled         2m15s               default-scheduler  Successfully assigned ai-inference/vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp to master1
  Normal   Pulled            2m14s               kubelet            Container image "hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1" already present on machine
  Normal   Created           2m14s               kubelet            Created container: mooncake-config-init
  Normal   Started           2m14s               kubelet            Started container mooncake-config-init
  Normal   Pulled            1s (x2 over 2m14s)  kubelet            Container image "cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2" already present on machine
  Normal   Created           1s (x2 over 2m14s)  kubelet            Created container: huggingface-download
  Normal   Started           1s (x2 over 2m14s)  kubelet            Started container huggingface-download

查看错误日志:

# 看当前这次的日志
kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download

# 看上一次失败的日志
kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download --previous

[root@master1 ~]# kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download
[root@master1 ~]# 
[root@master1 ~]# kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download --previous
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
    yield
  File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 250, in handle_request
    resp = self._pool.handle_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
    raise exc from None
  File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
    response = connection.handle_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection.py", line 101, in handle_request
    raise exc
  File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection.py", line 78, in handle_request
    stream = self._connect(request)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection.py", line 124, in _connect
    stream = self._network_backend.connect_tcp(**kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_backends/sync.py", line 207, in connect_tcp
    with map_exceptions(exc_map):
  File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ConnectError: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/_snapshot_download.py", line 240, in snapshot_download
    repo_info = api.repo_info(repo_id=repo_id, repo_type=repo_type, revision=revision)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 89, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 3285, in repo_info
    return method(
           ^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 89, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 3020, in model_info
    r = get_session().get(path, headers=headers, timeout=timeout, params=params)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1053, in get
    return self.request(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 825, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 914, in send
    response = self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 942, in _send_handling_auth
    response = self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
    response = self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1014, in _send_single_request
    response = transport.handle_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 249, in handle_request
    with map_httpcore_exceptions():
  File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 118, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ConnectError: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/hf", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/cli/hf.py", line 113, in main
    app()
  File "/usr/local/lib/python3.11/site-packages/typer/main.py", line 1152, in __call__
    raise e
  File "/usr/local/lib/python3.11/site-packages/typer/main.py", line 1135, in __call__
    return get_command(self)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1485, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/typer/core.py", line 795, in main
    return _main(
           ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/typer/core.py", line 188, in _main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1873, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1269, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/typer/main.py", line 1514, in wrapper
    return callback(**use_params)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/cli/download.py", line 224, in download
    _print_result(run_download())
                  ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/cli/download.py", line 185, in run_download
    return snapshot_download(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 89, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/_snapshot_download.py", line 324, in snapshot_download
    raise LocalEntryNotFoundError(
huggingface_hub.errors.LocalEntryNotFoundError: Got: ConnectError: [Errno 101] Network is unreachable
An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.

确认是网络问题:节点无法访问 HuggingFace(Network is unreachable),而且本地也没有缓存。
解决方案:换国内镜像源(推荐)
Deploymenthuggingface-download init container 里加一个环境变量:

env:
  - name: HF_ENDPOINT
    value: "https://hf-mirror.com"

最好给 decode-engine 也加上,否则报同样的错。

看日志可能没有任何输出

kubectl -n ai-inference logs deployments/vllm-pd-2p1d-01-decode huggingface-download  -f

此时查看 llm 目录大小即可,可以看到不断在变化:

$ watch -n 2 -d 'du -sh /home/llm_cache/'
426M    /home/llm_cache/

310p 运行报错

[root@master1 ~]# kubectl  -n ai-inference logs  vllm-pd-2p1d-01-decode-7d487c49cd-qw89v
Defaulted container "decode-engine" out of: decode-engine, mooncake-config-init (init), huggingface-download (init)
...
INFO 04-07 05:15:19 [__init__.py:217] Platform plugin ascend is activated
(EngineCore_DP0 pid=94) INFO 04-07 05:15:33 [ascend_config.py:55] Linear layer sharding enabled with config: None. Note: This feature works optimally with FLASHCOMM2 and DSA-CP enabled; using it without these features may result in significant performance degradation.
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] EngineCore failed to start.
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] Traceback (most recent call last):
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/platform/patch_core.py", line 59, in run_engine_core
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     super().__init__(
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     self._init_executor()
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     self.driver_worker.init_worker(all_kwargs=[kwargs])
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm/vllm/v1/worker/worker_base.py", line 313, in init_worker
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     self.worker = worker_class(**kwargs)
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]                   ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker.py", line 116, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     check_ascend_device_type()
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 708, in check_ascend_device_type
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     assert _ascend_device_type == cur_device_type, f"Current device type: {cur_device_type} does not match the installed version's device type: {_ascend_device_type}, please check your installation package."
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] AssertionError: Current device type: AscendDeviceType._310P does not match the installed version's device type: AscendDeviceType.A2, please check your installation package.
(EngineCore_DP0 pid=94) Process EngineCore_DP0:
(EngineCore_DP0 pid=94) Traceback (most recent call last):
(EngineCore_DP0 pid=94)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=94)     self.run()
(EngineCore_DP0 pid=94)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=94)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/platform/patch_core.py", line 72, in run_engine_core
(EngineCore_DP0 pid=94)     raise e
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/platform/patch_core.py", line 59, in run_engine_core
(EngineCore_DP0 pid=94)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=94)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=94)     super().__init__(
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=94)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=94)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=94)     self._init_executor()
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_DP0 pid=94)     self.driver_worker.init_worker(all_kwargs=[kwargs])
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm/vllm/v1/worker/worker_base.py", line 313, in init_worker
(EngineCore_DP0 pid=94)     self.worker = worker_class(**kwargs)
(EngineCore_DP0 pid=94)                   ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker.py", line 116, in __init__
(EngineCore_DP0 pid=94)     check_ascend_device_type()
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 708, in check_ascend_device_type
(EngineCore_DP0 pid=94)     assert _ascend_device_type == cur_device_type, f"Current device type: {cur_device_type} does not match the installed version's device type: {_ascend_device_type}, please check your installation package."
(EngineCore_DP0 pid=94)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) AssertionError: Current device type: AscendDeviceType._310P does not match the installed version's device type: AscendDeviceType.A2, please check your installation package.

经过排查是 hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0 镜像是针对 910 构建的,找到了官方说明

310p 应该使用带有 310p 后缀的镜像。在 镜像仓库 寻找后替换为 quay.io/ascend/vllm-ascend:v0.18.0rc1-310p-openeuler 尝试。

对比 sha256 发现 hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0 镜像是完全的 quay.io/ascend/vllm-ascend:v0.13.0 镜像,sha256 完全一致。

310p vllm-ascend 报错

但是替换为其他带有 310p 后缀的 image ,启动后会报错:

Every 1.0s: kubectl -n ai-inference get pod                                                            master1: Thu Apr  9 08:36:06 2026

NAME                                          READY   STATUS             RESTARTS        AGE
cache-indexer-deployment-65d5b449f6-x9l46     1/1     Running            0               17h
inference-gateway-istio-5f9b7d78f6-7kbrw      1/1     Running            26 (15h ago)    17h
infernex-epp-5cc456bd-4vvmv                   1/1     Running            0               17h
mooncake-master-deployment-74cc5666b7-fr4fq   1/1     Running            0               17h
redis-server-deployment-67566b9765-m66lc      1/1     Running            0               17h
vllm-pd-2p1d-01-decode-7687ccb7b-vg98n        0/1     CrashLoopBackOff   161 (31s ago)   16h
vllm-pd-2p1d-01-prefill-66f7564d7f-tdd84      0/1     Pending            0               43h
vllm-pd-2p1d-01-prefill-fd68f87cf-mhtdk       0/1     Pending            0               40h
vllm-pd-2p1d-01-proxy-7ff4f59865-h8xbw        1/1     Running            0               17h




────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(APIServer pid=1)   File "/vllm-workspace/vllm-ascend/vllm_ascend/distributed/kv_transfer/ascend_multi_connector.py", line 5, in <module>
(APIServer pid=1)     from vllm_ascend.distributed.kv_transfer.kv_p2p.mooncake_layerwise_connector import MooncakeLayerwiseConnector
(APIServer pid=1)   File "/vllm-workspace/vllm-ascend/vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py", line 25, in <module>
(APIServer pid=1)     from mooncake.engine import TransferEngine  # type: ignore
(APIServer pid=1)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ModuleNotFoundError: No module named 'mooncake'
(APIServer pid=1) [ERROR] 2026-04-08-08:58:03 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
[root@master1 ~]# kubectl  -n ai-inference describe pod vllm-pd-2p1d-01-decode-7687ccb7b-vg98n | grep -i image:
    Image:         hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1
    Image:         cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2
    Image:         quay.io/ascend/vllm-ascend:main-310p

针对该问题,openfuyao 给出的方案如下:

@tl.s InferNex在310P环境部署问题排查:

vllm-ascend的310P镜像没有加入mooncake,所以在prefill/decode之间kvcache数据传输无法支持。
https://github.com/vllm-project/vllm-ascend/blob/main/Dockerfile.310p

建议使用聚合模式部署,可参考InferNex聚合模式示例,部署时将`inference-backend.services[0].kvTransferConfig` 配置项删除,即可不使用mooncake相关能力:
https://gitcode.com/openFuyao/InferNex/blob/0.22.2/examples/vllm-aggregated-random-values.yaml

vllm-ascend针对310P在线推理文档:
https://docs.vllm.ai/projects/ascend/en/latest/tutorials/hardwares/310p.html#online-inference-on-npu

310P的推理还未验证过,可以尝试v0.13.0或者 v0.18.0rc1,这两个版本vllm官方有文档支持。

aggregated 模式卡个数错误

环境中只有一张 310p,但是默认申请两张,需要修改两处,分别是资源申请个数 1 ,以及 vllm 启动参数 tensor_parallel_size 设定为 1 。

resources:
          limits:
            cpu: "8"
            huawei.com/Ascend310P: "1"
            memory: 64Gi
          requests:
            cpu: "4"
            huawei.com/Ascend310P: "1"
            memory: 32Gi
...
  # start vllm service
          vllm serve Qwen/Qwen3-8B \
            --served-model-name Qwen/Qwen3-8B \
            --trust-remote-code \
            --enable-prefix-caching \
            --port 8000 \
            --tensor-parallel-size 1 \

bf16 数据类型报错

(EngineCore pid=35) [PID: 35] 2026-04-09-08:27:33.217.113 AclNN_Parameter_Error(EZ1001): Tensor self not implemented for DT_BFLOAT16, should be in dtype support list [DT_FLOAT,DT_FLOAT16,DT_INT8,DT_INT16,DT_INT32,DT_INT64,DT_UINT8,DT_BOOL,DT_DOUBLE,].
(EngineCore pid=35) 
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/python3.11.14/bin/vllm", line 6, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 92, in run
(APIServer pid=1)     return runner.run(wrapper())
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(APIServer pid=1) [ERROR] 2026-04-09-08:27:49 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
根据 AI 说法,暂未查证:Ascend 310P 芯片不支持 bfloat16(bf16)数据类型,但 vLLM 在初始化 rotary embedding(RoPE)时使用了 torch.ones(..., dtype=torch.bfloat16),导致 ACL 算子报错。
310P 的算子库支持的浮点类型只有 float32 和 float16,不包含 bf16。

通过参数强制使用 float16 --dtype half 解决:

bashvllm serve Qwen/Qwen3-8B \
  --dtype half \   # 强制使用 float16 而非 bfloat16
  ...其他参数

npu_dynamic_quant 算子报错

解决以上问题后 pod 可以运行更久

(EngineCore pid=35) INFO 04-09 08:43:11 [weight_utils.py:574] Time spent downloading weights for Qwen/Qwen3-8B: 1.077223 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:08<00:32,  8.07s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:16<00:24,  8.33s/it]

但最终依然 error ,跟踪报错信息如下:

(EngineCore pid=35) INFO 04-09 08:47:55 [default_loader.py:384] Loading weights took 35.12 seconds
(EngineCore pid=35) INFO 04-09 08:47:57 [model_runner_v1.py:2589] Loading model weights took 17.6043 GB
.(EngineCore pid=35) INFO 04-09 08:48:11 [backends.py:988] Using cache directory: /root/.cache/vllm/torch_compile_cache/4de24ceb58/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=35) INFO 04-09 08:48:11 [backends.py:1048] Dynamo bytecode transform time: 13.07 s
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] EngineCore failed to start.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] Traceback (most recent call last):
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     super().__init__(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     self.model_runner.profile_run()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     super().profile_run()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return super()._dummy_run(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     outputs = self._model_forward(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]               ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     hidden_states = self.model(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                     ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen3.py", line 322, in forward
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     hidden_states = self.model(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                     ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 597, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     output = TorchCompileWithNoGuardsWrapper.__call__(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 182, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._call_with_optional_nvtx_range(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return callable_fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 845, in compile_wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2196, in _call_user_compiler
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     raise BackendCompilerFailed(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2171, in _call_user_compiler
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     compiled_fn = compiler_fn(gm, example_inputs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 156, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     compiled_gm = compiler_fn(gm, example_inputs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/__init__.py", line 2437, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self.compiler_fn(model_, inputs_, **self.kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwds)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 1063, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     self.configure_post_pass()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 847, in configure_post_pass
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     self.pass_manager.configure(self.vllm_config)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/graph_fusion_pass_manager.py", line 55, in configure
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     self.passes.append(AddRMSNormQuantFusionPass(config))
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 493, in __init__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     AddRMSNormDynamicQuantPattern(vllm_config, eps=eps).register(self.pattern_match_passes)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/base_pattern.py", line 49, in register
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     pm.register_replacement(pattern_fn, replacement_fn, example_inputs, pm.fwd_only, pm_pass)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1552, in register_replacement
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     pattern, gm = gen_pattern_and_search_gm(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwds)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1760, in gen_pattern_and_search_gm
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     search_gm = trace_fn(search_fn, flat_inputs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 2115, in fwd_only
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     gm = make_fx(fn, decompositions, tracing_mode="real")(*args)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2429, in wrapped
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return make_fx_tracer.trace(f, *args)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2356, in trace
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._trace_inner(f, *args)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2318, in _trace_inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     t = dispatch_trace(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         ^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_compile.py", line 53, in inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return disable_fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1303, in dispatch_trace
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/_symbolic_trace.py", line 868, in trace
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     (self.create_arg(fn(*args)),),
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                      ^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1361, in wrapped
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     out = f(*tensors)  # type:ignore[call-arg]
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]           ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 300, in pattern
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     quantized_output = torch.ops.npu.npu_dynamic_quant(out0)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1409, in __torch_function__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_stats.py", line 28, in wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1534, in __torch_dispatch__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return proxy_call(self, func, self.pre_dispatch, args, kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 994, in proxy_call
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     out = func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]           ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 841, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] torch._dynamo.exc.BackendCompilerFailed: backend='<vllm.compilation.backends.VllmBackend object at 0xfffec83c17d0>' raised:
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] RuntimeError: npu_dynamic_quant:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:82 NPU function error: call aclnnDynamicQuantV2 failed, error code is 561103
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] [ERROR] 2026-04-09-08:48:12 (PID:35, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] [PID: 35] 2026-04-09-08:48:12.623.243 AclNN_Parameter_Error(EZ1001): DynamicQuant launch kernel failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         TraceBack (most recent call last):
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         Tiling failed
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         Tiling Failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         Kernel GetWorkspace failed. opType: 21
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         DynamicQuant launch kernel failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] 
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] 
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] 
(EngineCore pid=35) Process EngineCore:
(EngineCore pid=35) Traceback (most recent call last):
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=35)     self.run()
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore pid=35)     self._target(*self._args, **self._kwargs)
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=35)     raise e
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35)     super().__init__(
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35)     return self.collective_rpc("determine_available_memory")
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35)     self.model_runner.profile_run()
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35)     super().profile_run()
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35)     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35)                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35)     return super()._dummy_run(
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35)     outputs = self._model_forward(
(EngineCore pid=35)               ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35)     hidden_states = self.model(
(EngineCore pid=35)                     ^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35)     return self._call_impl(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35)     return forward_call(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen3.py", line 322, in forward
(EngineCore pid=35)     hidden_states = self.model(
(EngineCore pid=35)                     ^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 597, in __call__
(EngineCore pid=35)     output = TorchCompileWithNoGuardsWrapper.__call__(
(EngineCore pid=35)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 182, in __call__
(EngineCore pid=35)     return self._call_with_optional_nvtx_range(
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range
(EngineCore pid=35)     return callable_fn(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 845, in compile_wrapper
(EngineCore pid=35)     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
(EngineCore pid=35)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2196, in _call_user_compiler
(EngineCore pid=35)     raise BackendCompilerFailed(
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2171, in _call_user_compiler
(EngineCore pid=35)     compiled_fn = compiler_fn(gm, example_inputs)
(EngineCore pid=35)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 156, in __call__
(EngineCore pid=35)     compiled_gm = compiler_fn(gm, example_inputs)
(EngineCore pid=35)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/__init__.py", line 2437, in __call__
(EngineCore pid=35)     return self.compiler_fn(model_, inputs_, **self.kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35)     return func(*args, **kwds)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 1063, in __call__
(EngineCore pid=35)     self.configure_post_pass()
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 847, in configure_post_pass
(EngineCore pid=35)     self.pass_manager.configure(self.vllm_config)
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/graph_fusion_pass_manager.py", line 55, in configure
(EngineCore pid=35)     self.passes.append(AddRMSNormQuantFusionPass(config))
(EngineCore pid=35)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 493, in __init__
(EngineCore pid=35)     AddRMSNormDynamicQuantPattern(vllm_config, eps=eps).register(self.pattern_match_passes)
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/base_pattern.py", line 49, in register
(EngineCore pid=35)     pm.register_replacement(pattern_fn, replacement_fn, example_inputs, pm.fwd_only, pm_pass)
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1552, in register_replacement
(EngineCore pid=35)     pattern, gm = gen_pattern_and_search_gm(
(EngineCore pid=35)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35)     return func(*args, **kwds)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1760, in gen_pattern_and_search_gm
(EngineCore pid=35)     search_gm = trace_fn(search_fn, flat_inputs)
(EngineCore pid=35)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 2115, in fwd_only
(EngineCore pid=35)     gm = make_fx(fn, decompositions, tracing_mode="real")(*args)
(EngineCore pid=35)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2429, in wrapped
(EngineCore pid=35)     return make_fx_tracer.trace(f, *args)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2356, in trace
(EngineCore pid=35)     return self._trace_inner(f, *args)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2318, in _trace_inner
(EngineCore pid=35)     t = dispatch_trace(
(EngineCore pid=35)         ^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_compile.py", line 53, in inner
(EngineCore pid=35)     return disable_fn(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35)     return fn(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1303, in dispatch_trace
(EngineCore pid=35)     graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
(EngineCore pid=35)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35)     return fn(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/_symbolic_trace.py", line 868, in trace
(EngineCore pid=35)     (self.create_arg(fn(*args)),),
(EngineCore pid=35)                      ^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1361, in wrapped
(EngineCore pid=35)     out = f(*tensors)  # type:ignore[call-arg]
(EngineCore pid=35)           ^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 300, in pattern
(EngineCore pid=35)     quantized_output = torch.ops.npu.npu_dynamic_quant(out0)
(EngineCore pid=35)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35)     return self._op(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1409, in __torch_function__
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35)     return self._op(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_stats.py", line 28, in wrapper
(EngineCore pid=35)     return fn(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1534, in __torch_dispatch__
(EngineCore pid=35)     return proxy_call(self, func, self.pre_dispatch, args, kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 994, in proxy_call
(EngineCore pid=35)     out = func(*args, **kwargs)
(EngineCore pid=35)           ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 841, in __call__
(EngineCore pid=35)     return self._op(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) torch._dynamo.exc.BackendCompilerFailed: backend='<vllm.compilation.backends.VllmBackend object at 0xfffec83c17d0>' raised:
(EngineCore pid=35) RuntimeError: npu_dynamic_quant:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:82 NPU function error: call aclnnDynamicQuantV2 failed, error code is 561103
(EngineCore pid=35) [ERROR] 2026-04-09-08:48:12 (PID:35, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(EngineCore pid=35) [PID: 35] 2026-04-09-08:48:12.623.243 AclNN_Parameter_Error(EZ1001): DynamicQuant launch kernel failed.
(EngineCore pid=35)         TraceBack (most recent call last):
(EngineCore pid=35)         Tiling failed
(EngineCore pid=35)         Tiling Failed.
(EngineCore pid=35)         Kernel GetWorkspace failed. opType: 21
(EngineCore pid=35)         DynamicQuant launch kernel failed.
(EngineCore pid=35) 
(EngineCore pid=35) 
(EngineCore pid=35) Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(EngineCore pid=35) 
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/python3.11.14/bin/vllm", line 6, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 92, in run
(APIServer pid=1)     return runner.run(wrapper())
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(APIServer pid=1) [ERROR] 2026-04-09-08:48:30 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
根据 ai 的说法:根本原因:norm_quant 融合 Pass 在编译阶段向 310P 发起 npu_dynamic_quant 算子,而 310P 不支持该动态量化算子(或当前 CANN 版本不兼容),导致 Tiling 失败。

openfuyao 社区给出的回应如下:

@tl.s 
看了一下报错日志,应该是310卡不支持vllm-ascend默认开启的算子 DynamicQuantV2,可以加上启动配置项 --enforce-eager 和 --no-quant 尝试一下。
在InferNex中,默认未直接提供的vllm启动参数可以在 inference-backend.services[0].pd.prefill/decode.extraArgs添加。例如:
extraArgs:
    - "--enforce-eager"
    - "--no-quant "
若还是不行,可以尝试更换模型,按照官方310P文档内的示例部署,如 Qwen2.5-7B-Instruct。
https://docs.vllm.ai/projects/ascend/en/latest/tutorials/hardwares/310p.html#online-inference-on-npu

OOM

[root@master1 fuyao-26.3-rc3]# kubectl  -n ai-inference logs deployments/vllm-pd-2p1d-01 -f 
Defaulted container "aggregated-engine" out of: aggregated-engine, huggingface-download (init)



INFO 04-13 06:21:29 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 04-13 06:21:29 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 04-13 06:21:29 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-13 06:21:29 [__init__.py:239] Platform plugin ascend is activated
INFO 04-13 06:21:42 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.netloader.netloader.ModelNetLoaderElastic'>` with load format `netloader`
INFO 04-13 06:21:42 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.rfork.rfork_loader.RForkModelLoader'>` with load format `rfork`
WARNING 04-13 06:21:44 [__init__.py:80] The quantization method 'ascend' already exists and will be overwritten by the quantization config <class 'vllm_ascend._310p.quantization.modelslim_config.AscendModelSlimConfig310'>.
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297] 
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-7B-Instruct
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297] 
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-7B-Instruct', 'model': 'Qwen/Qwen2.5-7B-Instruct', 'trust_remote_code': True, 'dtype': 'float16', 'max_model_len': 4096, 'enforce_eager': True, 'served_model_name': ['Qwen/Qwen2.5-7B-Instruct'], 'block_size': 128, 'gpu_memory_utilization': 0.8, 'enable_prefix_caching': True, 'max_num_batched_tokens': 40960, 'kv_events_config': KVEventsConfig(enable_kv_cache_events=True, publisher='zmq', endpoint='tcp://*:5557', replay_endpoint=None, buffer_steps=10000, hwm=100000, max_queue_size=100000, topic='kv-events')}
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP_ADDR
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_SERVICE_PORT_HTTP_API
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_SERVICE_PORT
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_SERVICE_HOST
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_USE_V1
(APIServer pid=1) INFO 04-13 06:22:18 [model.py:533] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) WARNING 04-13 06:22:18 [model.py:1920] Casting torch.bfloat16 to torch.float16.
(APIServer pid=1) INFO 04-13 06:22:18 [model.py:1582] Using max model len 4096
(APIServer pid=1) INFO 04-13 06:22:18 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=40960.
(APIServer pid=1) INFO 04-13 06:22:18 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 04-13 06:22:18 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 04-13 06:22:18 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 04-13 06:22:18 [vllm.py:964] Cudagraph is disabled under eager mode
(APIServer pid=1) WARNING 04-13 06:22:26 [platform.py:749] Parameter '--disable-cascade-attn' is a GPU-specific feature. Resetting to False for Ascend.
(APIServer pid=1) WARNING 04-13 06:22:26 [platform.py:838] Ignored parameter 'disable_flashinfer_prefill'. This is a GPU-specific feature not supported on Ascend. Resetting to False.
(APIServer pid=1) INFO 04-13 06:22:26 [ascend_config.py:425] Dynamic EPLB is False
(APIServer pid=1) INFO 04-13 06:22:26 [ascend_config.py:426] The number of redundant experts is 0
(APIServer pid=1) INFO 04-13 06:22:26 [platform.py:297] Compilation disabled, using eager mode by default
(APIServer pid=1) INFO 04-13 06:22:26 [platform.py:502] Set PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
(APIServer pid=1) INFO 04-13 06:22:26 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
INFO 04-13 06:22:53 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 04-13 06:22:53 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 04-13 06:22:53 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-13 06:22:53 [__init__.py:239] Platform plugin ascend is activated
(EngineCore pid=35) INFO 04-13 06:23:03 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.netloader.netloader.ModelNetLoaderElastic'>` with load format `netloader`
(EngineCore pid=35) INFO 04-13 06:23:03 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.rfork.rfork_loader.RForkModelLoader'>` with load format `rfork`
(EngineCore pid=35) INFO 04-13 06:23:03 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=npu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-7B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'vllm_ascend.compilation.compiler_interface.AscendCompiler', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [40960], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=35) WARNING 04-13 06:23:08 [camem.py:66] Failed to import vllm_ascend_C:/vllm-workspace/vllm-ascend/vllm_ascend/vllm_ascend_C.cpython-311-aarch64-linux-gnu.so: undefined symbol: _ZN9pp_matmul17GetPpMatmulTilingERKNS_10MatMulInfoERKNS_12HardwareInfoERjRNS_18PpMatmulTilingDataE. Sleep mode will be disabled. 
(EngineCore pid=35) INFO 04-13 06:23:08 [ascend_config.py:425] Dynamic EPLB is False
(EngineCore pid=35) INFO 04-13 06:23:08 [ascend_config.py:426] The number of redundant experts is 0
INFO 04-13 06:23:22 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 04-13 06:23:22 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 04-13 06:23:22 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-13 06:23:22 [__init__.py:239] Platform plugin ascend is activated
....(EngineCore pid=35) INFO 04-13 06:24:35 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.137.164:42089 backend=hccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=35) INFO 04-13 06:24:36 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=35) WARNING 04-13 06:24:36 [worker.py:306] Bind cpus failed in rank0: Can not get running npu info. Skip binding cpu.
(EngineCore pid=35) INFO 04-13 06:24:37 [model_runner_v1.py:2562] Starting to load model Qwen/Qwen2.5-7B-Instruct...
(EngineCore pid=35) INFO 04-13 06:24:56 [weight_utils.py:574] Time spent downloading weights for Qwen/Qwen2.5-7B-Instruct: 4.228922 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:10<00:30, 10.12s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:20<00:20, 10.32s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:30<00:10, 10.21s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:40<00:00,  9.99s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:40<00:00, 10.08s/it]
(EngineCore pid=35) 
(EngineCore pid=35) INFO 04-13 06:25:46 [default_loader.py:384] Loading weights took 40.53 seconds
(EngineCore pid=35) INFO 04-13 06:25:48 [model_runner_v1.py:2589] Loading model weights took 16.2391 GB
.(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] EngineCore failed to start.
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] Traceback (most recent call last):
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     super().__init__(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     self.model_runner.profile_run()
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     super().profile_run()
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return super()._dummy_run(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     outputs = self._model_forward(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]               ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     hidden_states = self.model(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                     ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 583, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     hidden_states = self.model(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                     ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 439, in __call__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self.forward(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 444, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     hidden_states, residual = layer(positions, hidden_states, residual)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 311, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     hidden_states = self.mlp(hidden_states)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                     ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 114, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     gate_up, _ = self.gate_up_proj(x)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                  ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/linear.py", line 215, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return super().forward(input_)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 582, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     output_parallel = self.quant_method.apply(self, input_, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 228, in apply
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return dispatch_unquantized_gemm()(layer, x, layer.weight, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 55, in default_unquantized_gemm
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return torch.ops.vllm.unquantized_gemm(x, weight, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return super().__torch_function__(func, types, args, kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 27, in unquantized_gemm
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return torch.nn.functional.linear(x, weight, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] RuntimeError: NPU out of memory. Tried to allocate 2.89 GiB (NPU 0; 21.02 GiB total capacity; 17.36 GiB already allocated; 17.36 GiB current active; 2.15 GiB free; 17.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
(EngineCore pid=35) Process EngineCore:
(EngineCore pid=35) Traceback (most recent call last):
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=35)     self.run()
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore pid=35)     self._target(*self._args, **self._kwargs)
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=35)     raise e
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35)     super().__init__(
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35)     return self.collective_rpc("determine_available_memory")
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35)     self.model_runner.profile_run()
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35)     super().profile_run()
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35)     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35)                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35)     return super()._dummy_run(
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35)     outputs = self._model_forward(
(EngineCore pid=35)               ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35)     hidden_states = self.model(
(EngineCore pid=35)                     ^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35)     return self._call_impl(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35)     return forward_call(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 583, in forward
(EngineCore pid=35)     hidden_states = self.model(
(EngineCore pid=35)                     ^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 439, in __call__
(EngineCore pid=35)     return self.forward(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 444, in forward
(EngineCore pid=35)     hidden_states, residual = layer(positions, hidden_states, residual)
(EngineCore pid=35)                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35)     return self._call_impl(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35)     return forward_call(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 311, in forward
(EngineCore pid=35)     hidden_states = self.mlp(hidden_states)
(EngineCore pid=35)                     ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35)     return self._call_impl(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35)     return forward_call(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 114, in forward
(EngineCore pid=35)     gate_up, _ = self.gate_up_proj(x)
(EngineCore pid=35)                  ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35)     return self._call_impl(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35)     return forward_call(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/linear.py", line 215, in forward
(EngineCore pid=35)     return super().forward(input_)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 582, in forward
(EngineCore pid=35)     output_parallel = self.quant_method.apply(self, input_, bias)
(EngineCore pid=35)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 228, in apply
(EngineCore pid=35)     return dispatch_unquantized_gemm()(layer, x, layer.weight, bias)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 55, in default_unquantized_gemm
(EngineCore pid=35)     return torch.ops.vllm.unquantized_gemm(x, weight, bias)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35)     return self._op(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore pid=35)     return super().__torch_function__(func, types, args, kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35)     return self._op(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 27, in unquantized_gemm
(EngineCore pid=35)     return torch.nn.functional.linear(x, weight, bias)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) RuntimeError: NPU out of memory. Tried to allocate 2.89 GiB (NPU 0; 21.02 GiB total capacity; 17.36 GiB already allocated; 17.36 GiB current active; 2.15 GiB free; 17.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/python3.11.14/bin/vllm", line 6, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 92, in run
(APIServer pid=1)     return runner.run(wrapper())
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(APIServer pid=1) [ERROR] 2026-04-13-06:26:05 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

根据官网说明,似乎单张 310P 只能跑个 0.6B

Run the following script to start the vLLM server on NPU (Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards)

https://docs.vllm.ai/projects/ascend/en/latest/tutorials/hardwares/310p.html#online-inference-on-npu

而且需要一些参数

vllm serve Qwen/Qwen3-0.6B \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --enforce-eager \
    --dtype float16

Helm Update 报错

每次更新 helm 都需要经历以下几个回合,才能成功:

[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values 
Error: UPGRADE FAILED: cannot patch "vllm-pd-2p1d-01" with kind Deployment: Deployment.apps "vllm-pd-2p1d-01" is invalid: spec.selector: Invalid value: {"matchLabels":{"app.kubernetes.io/instance":"infernex-vllm-pd-2p1d-01","app.kubernetes.io/name":"inference-backend","openfuyao.com/dpSize":"1","openfuyao.com/engine":"vllm","openfuyao.com/model":"qwen-qwen3-0.6b","openfuyao.com/pdRole":"aggregate","openfuyao.com/tpSize":"2"}}: field is immutable
[root@master1 fuyao-26.3-rc3]# kubectl -n ai-inference delete deployments.apps vllm-pd-2p1d-01 
deployment.apps "vllm-pd-2p1d-01" deleted from ai-inference namespace
[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values 
Error: UPGRADE FAILED: post-upgrade hooks failed: warning: Hook post-upgrade infernex/charts/pd-orchestrator/charts/resourcescalinggroup/templates/webhook-wait-hook.yaml failed: 1 error occurred:
        * jobs.batch "infernex-resourcescalinggroup-wait-webhook" is forbidden: unable to create new content in namespace scaling-system because it is being terminated


[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values 
Error: UPGRADE FAILED: failed to create resource: namespaces "scaling-system" not found
[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values 


Release "infernex" has been upgraded. Happy Helming!
NAME: infernex
LAST DEPLOYED: Mon Apr 13 14:33:47 2026
NAMESPACE: ai-inference
STATUS: deployed
REVISION: 27
TEST SUITE: None

istio httproute 错误

[root@master1 fuyao-26.3-rc3]# kubectl  -n ai-inference describe httproutes.gateway.networking.k8s.io qwen-qwen3-0.6b-httproute 
Name:         qwen-qwen3-0.6b-httproute
Namespace:    ai-inference
Labels:       app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=infernex-epp
              app.kubernetes.io/version=0.21.0
Annotations:  meta.helm.sh/release-name: infernex
              meta.helm.sh/release-namespace: ai-inference
API Version:  gateway.networking.k8s.io/v1
Kind:         HTTPRoute
Metadata:
  Creation Timestamp:  2026-04-13T06:31:19Z
  Generation:          1
  Resource Version:    3664436
  UID:                 21dca6e8-483a-4b03-8a78-82559c45a7e3
Spec:
  Parent Refs:
    Group:  gateway.networking.k8s.io
    Kind:   Gateway
    Name:   inference-gateway
  Rules:
    Backend Refs:
      Group:   inference.networking.k8s.io
      Kind:    InferencePool
      Name:    qwen-qwen3-0.6b
      Weight:  1
    Matches:
      Path:
        Type:   PathPrefix
        Value:  /
    Timeouts:
      Request:  300s
Status:
  Parents:
    Conditions:
      Last Transition Time:  2026-04-13T06:31:19Z
      Message:               Route was valid
      Observed Generation:   1
      Reason:                Accepted
      Status:                True
      Type:                  Accepted
      Last Transition Time:  2026-04-13T06:31:19Z
      Message:               InferencePool.Name invalid; the name of the InferencePool must be used, not the hostname.
      Observed Generation:   1
      Reason:                InvalidDestination
      Status:                False
      Type:                  ResolvedRefs
    Controller Name:         istio.io/gateway-controller
    Parent Ref:
      Group:  gateway.networking.k8s.io
      Kind:   Gateway
      Name:   inference-gateway
Events:       <none>

总结

  • helm 默认申请 2 张 310P ,需手动修改

    resources:
    limits:
      cpu: "8"
      huawei.com/Ascend310P: "1"
      memory: 64Gi
    requests:
      cpu: "4"
      huawei.com/Ascend310P: "1"
      memory: 32Gi
  • huggerface 下载需手动配置国内源

    - name: HF_ENDPOINT
      value: https://hf-mirror.com
  • vllm 启动参数需手动调整

    vllm serve Qwen/Qwen3-0.6B \
    --served-model-name Qwen/Qwen3-0.6B \
    --trust-remote-code \
    --enable-prefix-caching \
    --port 8000 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --enforce-eager \
    --dtype float16 \
    --max-num-batched-tokens 40960 \
    --data-parallel-size 1 \
    --gpu-memory-utilization 0.8 \
    --block-size 128 \
    --kv-events-config '{"enable_kv_cache_events": true, "publisher":"zmq", "topic":"kv-events"}'
  • 默认服务采用 ClusterAPI 使用 Gateway istio 暴露服务,目前还没有正常工作
  • vllm pod 看到频繁打印
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK

Refs

ceph mon Operation not permitted 问题解决

作者 tl.s
2026年4月3日 11:25

自己构建的 ceph 发现 mon 起不来,报错如下:

Apr 03 11:14:30 debian systemd[1]: Started Ceph cluster monitor daemon.
░░ Subject: A start job for unit ceph-mon@debian.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A start job for unit ceph-mon@debian.service has finished successfully.
░░ 
░░ The job identifier is 6997.
Apr 03 11:14:31 debian ceph-mon[374601]: 2026-04-03T11:14:31.084+0800 ffffaf907040 -1 load: jerasure load: lrc load dlopen(/usr/lib/ceph/erasure-code/libec_isa.so): /usr/lib/ceph/erasure-code/libec_isa.so: cannot make segment writable for relocation: Operation not permitted
Apr 03 11:14:31 debian systemd[1]: ceph-mon@debian.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ An ExecStart= process belonging to unit ceph-mon@debian.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
Apr 03 11:14:31 debian systemd[1]: ceph-mon@debian.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ The unit ceph-mon@debian.service has entered the 'failed' state with result 'exit-code'.

临时解决方案

根据 claude-sonnet 4.6 的说法:

现在最快能让服务跑起来的办法就是 systemd override,直接关掉那个内存保护:

bashmkdir -p /etc/systemd/system/ceph-mon@.service.d/

cat > /etc/systemd/system/ceph-mon@.service.d/override.conf << 'EOF'
[Service]
MemoryDenyWriteExecute=false
EOF

systemctl daemon-reload
systemctl restart ceph-mon@$(hostname)
systemctl status ceph-mon@$(hostname)

同样的问题也会出现在 ceph-osd 和 ceph-mds,可以一并处理。

for svc in ceph-mon ceph-osd ceph-mds; do
  mkdir -p /etc/systemd/system/${svc}@.service.d/
  cat > /etc/systemd/system/${svc}@.service.d/override.conf << 'EOF'
[Service]
MemoryDenyWriteExecute=false
EOF
done

systemctl daemon-reload
systemctl restart ceph-mon@$(hostname)

这个方法不需要重新编译,直接解决 Operation not permitted 的问题。

MemoryDenyWriteExecute=false 只是关闭了一个进程级别的内存安全加固选项,它的作用是:阻止进程将同一块内存页同时标记为"可写"和"可执行"。
唯一的代价是理论上轻微降低了安全加固等级,即如果 ceph-mon 进程被攻击者利用漏洞控制,攻击者稍微容易一点点注入可执行代码。对于内网存储集群这个风险几乎可以忽略。

这个办法可以临时解决,但不确定是否有风险,需要充分验证后再使用。

彻底解决方案

下面给出了一个彻底解决方案:

唯一正确的根本解法:编译时禁用 ISA 插件
debian/rules 里找到 cmake 参数,加入 -DWITH_EC_ISA_PLUGIN=OFF

bashgrep -n "cmake\|CMAKE" ~/build-ceph/ceph/ceph-16.2.7/debian/rules | head -20

找到 cmake 调用的那行,加入该参数后重新编译。

永久解决方案暂未验证。

Ascend 310P + openFuyao + NPU-Operator 故障排查

作者 tl.s
2026年4月1日 22:26

[TOC]

故障 pod describe

[root@master1 ~]# kubectl -n kube-system describe pod ascend-device-plugin-ll46f 
Name:                 ascend-device-plugin-ll46f
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      ascend-device-plugin-sa
Node:                 master1/10.17.30.131
Start Time:           Mon, 30 Mar 2026 11:08:32 +0800
Labels:               app.kubernetes.io/managed-by=npu-operator
                      controller-revision-hash=7df5dcb887
                      helm.sh/chart=npu-operator-0.15.0
                      name=ascend-device-plugin-ds
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: c1f2adcaeaaf2bdcf0a6e09730f68231a293074e31d58f61997f714dfb520878
                      cni.projectcalico.org/podIP: 192.168.137.118/32
                      cni.projectcalico.org/podIPs: 192.168.137.118/32
                      scheduler.alpha.kubernetes.io/critical-pod: 
                      seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:               Running
IP:                   192.168.137.118
IPs:
  IP:           192.168.137.118
Controlled By:  DaemonSet/ascend-device-plugin
Init Containers:
  init-permission:
    Container ID:  containerd://4406968a522bea48dfefebae81ec53644312762af4781c25de689952ed6c2d27
    Image:         cr.openfuyao.cn/openfuyao/busybox:1.36.1
    Image ID:      cr.openfuyao.cn/openfuyao/busybox@sha256:4b8407fadd8100c61b097d63efe992b2c033e7d371c9117f7a9462fe87e31176
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      chown 9000:9000 /var/log/mindx-dl /var/log/mindx-dl/devicePlugin
      chmod 750 /var/log/mindx-dl/devicePlugin
      
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 30 Mar 2026 15:28:32 +0800
      Finished:     Mon, 30 Mar 2026 15:28:32 +0800
    Ready:          True
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /var/log/mindx-dl/devicePlugin from log-path (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gfldg (ro)
Containers:
  device-plugin-01:
    Container ID:  containerd://fcc0c4742285847e2621a9a9217502307fc7e28644fbf86b32f9c11d67a2c0ab
    Image:         cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0
    Image ID:      cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin@sha256:a5b9612b21bcd35384f9f19a05b2d7915b865e7b2be6a30bfd7806a9b8a86f58
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      --
    Args:
      device-plugin  -useAscendDocker=true -volcanoType=false -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 31 Mar 2026 10:28:58 +0800
      Finished:     Tue, 31 Mar 2026 10:28:58 +0800
    Ready:          False
    Restart Count:  274
    Limits:
      cpu:     500m
      memory:  500Mi
    Requests:
      cpu:     500m
      memory:  500Mi
    Environment:
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /tmp from tmp (rw)
      /usr/local/Ascend/driver from hiai-driver (ro)
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/lib/kubelet/pod-resources from pod-resource (rw)
      /var/log/mindx-dl/devicePlugin from log-path (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gfldg (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  pod-resource:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/pod-resources
    HostPathType:  
  hiai-driver:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/Ascend/driver
    HostPathType:  
  log-path:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/mindx-dl/devicePlugin
    HostPathType:  DirectoryOrCreate
  tmp:
    Type:          HostPath (bare host directory volume)
    Path:          /tmp
    HostPathType:  
  kube-api-access-gfldg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              openfuyao.com/npu.present=
Tolerations:                 CriticalAddonsOnly op=Exists
                             device-plugin=v2:NoSchedule
                             huawei.com/Ascend910:NoSchedule op=Exists
                             node-role.kubernetes.io/control-plane:NoSchedule
                             node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulled   16m (x205 over 18h)     kubelet  (combined from similar events): Successfully pulled image "cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0" in 403ms (403ms including waiting). Image size: 48017174 bytes.
  Warning  BackOff  2m47s (x5216 over 18h)  kubelet  Back-off restarting failed container device-plugin-01 in pod ascend-device-plugin-ll46f_kube-system(8edcd384-ab2d-4998-8077-5ac58801c79e)
  Normal   Pulling  66s (x227 over 19h)     kubelet  Pulling image "cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0"

故障 pod /dev 检查

[root@master1 fuyao-26.3-rc3]# kubectl  -n kube-system exec -it daemonsets/ascend-device-plugin -- ls /dev
Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init)
autofs           null               tty10  tty34  tty58    vcs5
bsg              ppp                tty11  tty35  tty59    vcs6
btrfs-control    ptmx               tty12  tty36  tty6     vcsa
bus              pts                tty13  tty37  tty60    vcsa1
core             random             tty14  tty38  tty61    vcsa2
cpu_dma_latency  raw                tty15  tty39  tty62    vcsa3
cuse             relationship_ctrl  tty16  tty4   tty63    vcsa4
davinci0         rfkill             tty17  tty40  tty7     vcsa5
davinci_manager  rtc0               tty18  tty41  tty8     vcsa6
devmm_svm        sda                tty19  tty42  tty9     vcsu
dri              sda1               tty2   tty43  ttyAMA0  vcsu1
fb0              sda2               tty20  tty44  ttyS0    vcsu2
fd               sg0                tty21  tty45  ttyS1    vcsu3
full             sg1                tty22  tty46  ttyS2    vcsu4
fuse             sg2                tty23  tty47  ttyS3    vcsu5
hidraw0          shm                tty24  tty48  uhid     vcsu6
hidraw1          snapshot           tty25  tty49  uinput   vfio
hisi_hdc         sr0                tty26  tty5   urandom  vga_arbiter
hwrng            sr1                tty27  tty50  usbmon0  vhost-net
input            stderr             tty28  tty51  usbmon1  vhost-vsock
kmsg             stdin              tty29  tty52  usbmon2  vport2p1
loop-control     stdout             tty3   tty53  vcs      zero
mapper           termination-log    tty30  tty54  vcs1
mem              tty                tty31  tty55  vcs2
mqueue           tty0               tty32  tty56  vcs3
net              tty1               tty33  tty57  vcs4

故障 pod 驱动检查

[root@master1 fuyao-26.3-rc3]# kubectl  -n kube-system exec -it daemonsets/ascend-device-plugin -- ls -lha /usr/local/Ascend/driver
Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init)
total 44K
drwxr-xr-x  8 root root 4.0K Mar 27 08:03 .
drwxr-xr-x  3 root root 4.0K Mar 31 02:34 ..
drwxr-xr-x  2 root root 4.0K Mar 27 08:01 bin
-r--r--r--  1 root root   20 Mar 27 08:01 build.info
dr-xr-x---  2 root root 4.0K Mar 27 08:01 device
dr-x------ 41 root root 4.0K Mar 27 08:01 kernel
drwxr-xr-x  6 root root 4.0K Mar 27 08:01 lib64
-r--r-----  1 root root   56 Mar 27 08:01 scene.info
dr-xr-x---  2 root root 4.0K Mar 27 08:01 script
drwxr-xr-x  2 root root 4.0K Mar 27 08:01 tools
-r--r--r--  1 root root  352 Mar 27 08:03 version.info

故障 pod 日志

[root@master1 ~]# kubectl -n kube-system logs daemonsets/ascend-device-plugin --previous
Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init)
[INFO]     2026/03/31 06:46:54.593254 1       hwlog/api.go:108    devicePlugin.log's logger init success
[INFO]     2026/03/31 06:46:54.593449 1       main.go:187    ascend device plugin starting and the version is v6.0.0_linux-aarch64
[INFO]     2026/03/31 06:46:54.593494 1       main.go:188    ascend device plugin starting scene is center
[INFO]     2026/03/31 06:46:54.787930 1       devmanager/devmanager.go:104    the dcmi version is 24.1.rc3
[ERROR]    2026/03/31 06:46:54.788019 1       devmanager/devmanager.go:211    get error card quantity: 0
[ERROR]    2026/03/31 06:46:54.788052 1       devmanager/devmanager.go:195    get card list failed for init
[ERROR]    2026/03/31 06:46:54.788101 1       main.go:203    init devmanager failed, err: auto init failed, err: get card list failed for init

故障 pod 驱动检查

[root@master1 ~]# kubectl -n kube-system exec -it daemonsets/ascend-device-plugin -- bash -c 'find /usr/local/Ascend/driver -name libdcmi.so 2>/dev/null; echo $LD_LIBRARY_PATH'
Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init)
/usr/local/Ascend/driver/lib64/driver/libdcmi.so
command terminated with exit code 137
[root@master1 ~]# ps -ef | grep -E 'dmp_daemon|slogd' | grep -v grep
root       21578       1  0 Mar30 ?        00:00:19 /usr/sbin/rsyslogd -n -i/var/run/rsyslogd.pid

检查服务状态?

[root@master1 ~]# systemctl status ascend-dmi
Unit ascend-dmi.service could not be found.
[root@master1 ~]# systemctl status ascend-dkms
Unit ascend-dkms.service could not be found.
[root@master1 ~]# systemctl status npu-smi
Unit npu-smi.service could not be found.
[root@master1 ~]# find / -name dmp_daemon 2>/dev/null
[root@master1 ~]# find / -name slogd 2>/dev/null
[root@master1 ~]# ls -l /var/dmp_daemon /var/slogd 2>/dev/null
[root@master1 ~]# 

dcmi 问题,需硬件排查

检查卡获取

#include <stdlib.h>
#include <stdio.h>
#include "dcmi_interface_api.h"

int my_get_card_list();

int main(int argc,char *argv[])
{
    my_get_card_list();
    return 0;
}

int my_get_card_list()
{
    printf("\n==================================card id info list=========================\n");
    dcmi_init();
    int card_num = 0;
    int card_list[16] = {0};
    int ret = dcmi_get_card_list(&card_num, card_list, 16);
    if (ret != DCMI_OK) {
        printf("dcmi get card list failed ret=%d\n", ret);
    }
    printf("card_num=%d, card_list:[",card_num);
    for (int i = 0; i < card_num; i++) {
        printf("%d ", card_list[i]);
    }
}
cc ./test1.c -o test1 -I /usr/local/dcmi -L /usr/local/dcmi -ldcmi

-I头文件(.h)搜索路径
-L库文件(.so/.a)搜索路径
-l链接的库名(去掉 lib 前缀)
nerdctl run --rm \
  -v /usr/local/Ascend:/usr/local/Ascend \
  -v /usr/local/dcmi:/usr/local/dcmi \
  -v $(pwd):/build \
  ubuntu:18.04 bash -c "
    sed -i -e 's@http*://ports.ubuntu.com/\? @http://10.17.31.217:8081/repository/mirror-ubuntu-ports/@g' \
           -e 's@http*://ports.ubuntu.com@http://10.17.31.217:8081/repository/mirror-ubuntu-ports@g' \
           /etc/apt/sources.list
    apt update && apt install -y gcc
    cd /build
    cc ./test1.c -o test1 \
      -I /usr/local/dcmi \
      -L /usr/local/dcmi \
      -L /usr/local/Ascend/driver/lib64/common \
      -L /usr/local/Ascend/driver/lib64/driver \
      -ldcmi \
      -Wl,-rpath,/usr/local/Ascend/driver/lib64/common \
      -Wl,-rpath,/usr/local/Ascend/driver/lib64/driver \
      -Wl,-rpath,/usr/local/dcmi
  "

分析二进制:

[root@master1 ascend_debug]# ldd ./test1 | grep -i dcmi
        libdcmi.so => /usr/local/Ascend/driver/lib64/driver/libdcmi.so (0x0000ffffa6dd0000)
[root@master1 ascend_debug]# LD_DEBUG=libs ./test1 2>&1 | grep -i dcmi
    284830:     find library=libdcmi.so [0]; searching
    284830:      search path=/usr/local/Ascend/driver/lib64/common/tls/aarch64/atomics:/usr/local/Ascend/driver/lib64/common/tls/aarch64:/usr/local/Ascend/driver/lib64/common/tls/atomics:/usr/local/Ascend/driver/lib64/common/tls:/usr/local/Ascend/driver/lib64/common/aarch64/atomics:/usr/local/Ascend/driver/lib64/common/aarch64:/usr/local/Ascend/driver/lib64/common/atomics:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver/tls/aarch64/atomics:/usr/local/Ascend/driver/lib64/driver/tls/aarch64:/usr/local/Ascend/driver/lib64/driver/tls/atomics:/usr/local/Ascend/driver/lib64/driver/tls:/usr/local/Ascend/driver/lib64/driver/aarch64/atomics:/usr/local/Ascend/driver/lib64/driver/aarch64:/usr/local/Ascend/driver/lib64/driver/atomics:/usr/local/Ascend/driver/lib64/driver:/usr/local/dcmi/tls/aarch64/atomics:/usr/local/dcmi/tls/aarch64:/usr/local/dcmi/tls/atomics:/usr/local/dcmi/tls:/usr/local/dcmi/aarch64/atomics:/usr/local/dcmi/aarch64:/usr/local/dcmi/atomics:/usr/local/dcmi            (RUNPATH from file ./test1)
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/tls/aarch64/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/tls/aarch64/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/tls/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/tls/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/aarch64/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/aarch64/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/common/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/tls/aarch64/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/tls/aarch64/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/tls/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/tls/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/aarch64/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/aarch64/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/atomics/libdcmi.so
    284830:       trying file=/usr/local/Ascend/driver/lib64/driver/libdcmi.so
    284830:      search path=/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/dcmi/tls/aarch64/atomics:/usr/local/dcmi/tls/aarch64:/usr/local/dcmi/tls/atomics:/usr/local/dcmi/tls:/usr/local/dcmi/aarch64/atomics:/usr/local/dcmi/aarch64:/usr/local/dcmi/atomics:/usr/local/dcmi          (RUNPATH from file ./test1)
    284830:       trying file=/usr/local/dcmi/tls/aarch64/atomics/libc.so.6
    284830:       trying file=/usr/local/dcmi/tls/aarch64/libc.so.6
    284830:       trying file=/usr/local/dcmi/tls/atomics/libc.so.6
    284830:       trying file=/usr/local/dcmi/tls/libc.so.6
    284830:       trying file=/usr/local/dcmi/aarch64/atomics/libc.so.6
    284830:       trying file=/usr/local/dcmi/aarch64/libc.so.6
    284830:       trying file=/usr/local/dcmi/atomics/libc.so.6
    284830:       trying file=/usr/local/dcmi/libc.so.6
    284830:     calling init: /usr/local/Ascend/driver/lib64/driver/libdcmi.so
    284830:     calling fini: /usr/local/Ascend/driver/lib64/driver/libdcmi.so [0]
[root@master1 ascend_debug]# readlink -f /usr/local/dcmi/libdcmi.so
/usr/local/dcmi/libdcmi.so
[root@master1 ascend_debug]# readlink -f /usr/local/Ascend/driver/lib64/driver/libdcmi.so
/usr/local/Ascend/driver/lib64/driver/libdcmi.so
[root@master1 ascend_debug]# sha256sum /usr/local/dcmi/libdcmi.so /usr/local/Ascend/driver/lib64/driver/libdcmi.so
13a38cae84bad0f06367ff9280016e372c0608ca16465b5ae5f000d3844ee401  /usr/local/dcmi/libdcmi.so
13a38cae84bad0f06367ff9280016e372c0608ca16465b5ae5f000d3844ee401  /usr/local/Ascend/driver/lib64/driver/libdcmi.so

跟踪 strace

宿主机跑
strace -f -o /tmp/host.strace -e trace=file,ioctl ./test1
容器里跑
strace -f -o /tmp/container.strace -e trace=file,ioctl ./test1

新增挂载继续跟踪

volumeMounts:
        - name: hdc-basic
          mountPath: /etc/hdcBasic.cfg
          readOnly: true
        - name: localtime
          mountPath: /etc/localtime
          readOnly: true
      volumes:
      - name: hdc-basic
        hostPath:
          path: /etc/hdcBasic.cfg
          type: File
      - name: localtime
        hostPath:
          path: /etc/localtime
          type: File

跟踪并查看日志

kubectl -n kube-system exec -it ascend-device-plugin-69q5t -c device-plugin-01 -- bash

strace -f -o /tmp/container.strace -e trace=file,ioctl ./test1

root@ascend-device-plugin-69q5t:/tmp# strace -f -o /tmp/container.strace -e trace=file,ioctl ./test1

==================================card id info list=========================
card_num=0, card_list:[


root@ascend-device-plugin-69q5t:/tmp# cat /var/log/nputools_LOG_INFO.log > /tmp/nputools_LOG_INFO.log
root@ascend-device-plugin-69q5t:/tmp# cat /var/log/nputools_LOG_ERR.log > /tmp/nputools_LOG_ERR.log
cat: /var/log/nputools_LOG_ERR.log: No such file or directory
root@ascend-device-plugin-69q5t:/tmp# cat /tmp/nputools_LOG_INFO.log 
[2026/04/01 11:18:12][0583][root][127.0.0.1][dcmi_api.c,dcmi_board_init,86]:dcmi board init success. device_count=1.
[2026/04/01 11:18:12][0583][root][127.0.0.1][dcmi_api.c,dcmi_init,119]:dcmi init all success.

检查代码2

#include <stdio.h>
#include <stdlib.h>
#include "dcmi_interface_api.h"

#ifndef DCMI_OK
#define DCMI_OK 0
#endif

/* 头文件里没看到这个声明,手动补一个 */
extern int dcmi_get_card_num_list(int *card_num, int *card_list, int list_length);

static void print_list(const char *name, int ret, int num, int *list) {
    printf("%s ret=%d num=%d list=[", name, ret, num);
    for (int i = 0; i < num; ++i) {
        printf("%d ", list[i]);
    }
    printf("]\n");
}

int main(void) {
    int ret = dcmi_init();
    printf("dcmi_init ret=%d\n", ret);
    if (ret != DCMI_OK) {
        return 1;
    }

    int card_num = 0;
    int card_list[16] = {0};

    ret = dcmi_get_card_list(&card_num, card_list, 16);
    print_list("dcmi_get_card_list", ret, card_num, card_list);

    int card_num2 = 0;
    int card_list2[16] = {0};
    ret = dcmi_get_card_num_list(&card_num2, card_list2, 16);
    print_list("dcmi_get_card_num_list", ret, card_num2, card_list2);

    for (int i = 0; i < card_num && i < 16; ++i) {
        int dev_num = -1;
        ret = dcmi_get_device_num_in_card(card_list[i], &dev_num);
        printf("dcmi_get_device_num_in_card card=%d ret=%d dev_num=%d\n",
               card_list[i], ret, dev_num);
    }

    return 0;
}

主机编译

cc ./test2.c -o test2 -I /usr/local/dcmi -L /usr/local/dcmi -ldcmi

容器编译

nerdctl run --rm \
  -v /usr/local/Ascend:/usr/local/Ascend \
  -v /usr/local/dcmi:/usr/local/dcmi \
  -v $(pwd):/build \
  ubuntu:18.04 bash -c "
    sed -i -e 's@http*://ports.ubuntu.com/\? @http://10.17.31.217:8081/repository/mirror-ubuntu-ports/@g' \
           -e 's@http*://ports.ubuntu.com@http://10.17.31.217:8081/repository/mirror-ubuntu-ports@g' \
           /etc/apt/sources.list
    apt update && apt install -y gcc
    cd /build
    cc ./test2.c -o test2 \
      -I /usr/local/dcmi \
      -L /usr/local/dcmi \
      -L /usr/local/Ascend/driver/lib64/common \
      -L /usr/local/Ascend/driver/lib64/driver \
      -ldcmi \
      -Wl,-rpath,/usr/local/Ascend/driver/lib64/common \
      -Wl,-rpath,/usr/local/Ascend/driver/lib64/driver \
      -Wl,-rpath,/usr/local/dcmi
  "

拷入容器运行

kubectl -n kube-system cp ./test2 ascend-device-plugin-69q5t:/tmp/
# 主机运行
[root@master1 ascend_debug]# ./test2
dcmi_init ret=0
dcmi_get_card_list ret=0 num=1 list=[176 ]
dcmi_get_card_num_list ret=0 num=1 list=[176 ]
dcmi_get_device_num_in_card card=176 ret=0 dev_num=1

# 容器运行
root@ascend-device-plugin-69q5t:/tmp# ./test2 
dcmi_init ret=0
dcmi_get_card_list ret=0 num=0 list=[]
dcmi_get_card_num_list ret=0 num=0 list=[]

虚拟机场景

经过许老师认真定位,最终发现是因为非裸金属环境。虚拟机场景需要定制镜像。

根据官网文档

如果在虚拟机场景下部署Ascend Device Plugin,需要在Ascend Device Plugin的镜像中安装systemd,推荐在Dockerfile中加入RUN apt-get update && apt-get install -y systemd命令进行安装。

为了使用 nerdctl 构建镜像首先安装 buildkit

wegt https://github.com/moby/buildkit/releases/download/v0.29.0/buildkit-v0.29.0.linux-arm64.tar.gz
tar zxvf buildkit-v0.29.0.linux-arm64.tar.gz
cp bin/* /usr/local/bin/

之后找一个新终端启动 buildkit ,这里是为了 nerdctl 构建 image, 如果不需要则不用启动。

buildkitd --oci-worker=false --containerd-worker=true --containerd-worker-namespace=k8s.io 

Dockerfile 如下:

镜像源部分按需修改
FROM hub.oepkgs.net/openfuyao/ascendhub/ascend-k8sdeviceplugin:v6.0.0

# 替换 apt 镜像源
RUN sed -i \
    -e 's@http*://ports.ubuntu.com/\? @http://10.17.31.217:8081/repository/mirror-ubuntu-ports/@g' \
    -e 's@http*://ports.ubuntu.com@http://10.17.31.217:8081/repository/mirror-ubuntu-ports@g' \
    /etc/apt/sources.list

# 安装 systemd
RUN apt-get update && \
    apt-get install -y --no-install-recommends systemd systemd-sysv && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# 设置 systemd 为 init
STOPSIGNAL SIGRTMIN+3
STOPSIGNAL SIGRTMIN+3 是告诉容器运行时(containerd/docker)停止这个容器时应该发送哪个信号。
如果你的容器里 不跑 systemd 作为主进程(比如 entrypoint 是业务程序),这行可以删掉,没有任何作用。
如果确实用 systemd 管理容器内服务,保留它能避免 kubectl delete pod 时等待 30 秒超时再强杀的问题。

构建命令如下:

nerdctl build \
  --namespace k8s.io \
  -t hub.oepkgs.net/openfuyao/ascendhub/ascend-k8sdeviceplugin:v6.0.0-systemd \
  -f Dockerfile \
  .

之后将出问题的镜像替换为新构建的镜像即可。

npu-operator 有同样的问题,一样修改即可。

修复确认

最终在 node 中能看到 npu 资源即成功。
[root@master1 ~]# kubectl  describe node master1 
Name:               master1
Roles:              control-plane,master,node,worker
Labels:             accelerator=huawei-Ascend310P
                    beta.kubernetes.io/arch=arm64
                    beta.kubernetes.io/os=linux
...
                    servertype=Ascend310P-8
                    workerselector=dls-worker-node
Annotations:        baseDeviceInfos: {"Ascend310P-0":{"IP":"","SuperDeviceID":0}}
...
Capacity:
  cpu:                    16
  ephemeral-storage:      129724184Ki
  huawei.com/Ascend310P:  1
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  hugepages-32Mi:         0
  hugepages-64Ki:         0
  memory:                 32595632Ki
  pods:                   110
Allocatable:
  cpu:                    16
  ephemeral-storage:      119553807777
  huawei.com/Ascend310P:  1
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  hugepages-32Mi:         0
  hugepages-64Ki:         0
  memory:                 32493232Ki
  pods:                   110
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource               Requests       Limits
  --------               --------       ------
  cpu                    15565m (97%)   16910m (105%)
  memory                 17492Mi (55%)  30900Mi (97%)
  ephemeral-storage      0 (0%)         0 (0%)
  hugepages-1Gi          0 (0%)         0 (0%)
  hugepages-2Mi          0 (0%)         0 (0%)
  hugepages-32Mi         0 (0%)         0 (0%)
  hugepages-64Ki         0 (0%)         0 (0%)
  huawei.com/Ascend310P  0              0
...

KDE Plasma6 禁用全局菜单,恢复正常应用菜单

作者 tl.s
2026年4月1日 13:55

前情提要

不知道从什么时候开始,KDE Plasma 默认启用类似 macOS 的全局应用菜单。

即应用窗口标题栏下方不显示菜单,而是移动到顶部菜单栏中“全局菜单”小组件中。

但问题是,Linux 桌面生态生态复杂,X11 Wayland Qt GTK 等等技术太过复杂,很难保证常用软件都能够正常显示全局菜单。

比如我最近在使用 Joplin ,就发现除了菜单栏根本找不到任何入口。

于是搜索了一番后,终于找到了关闭全局菜单,恢复正常的应用菜单的方法。

恢复方法

第一步:移除“全局菜单组件”

Edit Mode > Add or Manage Widgets > Global Menu > Remove all instances (button in the top right corner of the widget)

进入编辑模式,添加或管理组件,删除“全局菜单”小组间

image

第二步:移除应用菜单

实测完成第一步即可,第二步按照自己实际情况决定是否要做。

Settings > Colors & Themes > Window Decorations > Configure Titlebar Buttons…

There remove the Application Menu (“Hamburger”) button from your titlebars.

进入设置,颜色与主题,窗口装饰元素,配置菜单栏按钮,将“应用菜单”按钮移除,点击应用,即可。

image

第三步:重启应用

此时应该就能看到应用菜单了,如果看不到再重启一下即可。

image

Refs

终极指南:在 Linux 裸机服务器上快速部署 Moltbot (原 Clawbot) 并集成飞书

作者 tl.s
2026年1月29日 20:00

引言

2026 年初,一个名为 Moltbot(前身为 Clawbot)的开源 AI 代理(Agent)框架席卷了开发者社区。该框架允许用户将强大的 AI 模型(例如 OpenAI 的 GPT 系列、Anthropic 的 Claude 等)与 WhatsApp、Telegram、Discord 等日常通讯工具集成,从而通过聊天即可操控电脑、执行任务、获取信息。特斯拉前 AI 主管 Andrej Karpathy 的推荐更是使其迅速走红,其 GitHub 项目在短时间内获得了超过 60,000 个星标。

本指南将提供一份以安全为核心的详尽教程,旨在引导您逐步在 Linux 裸机服务器上安全地完成 Moltbot 的部署。

准备工作

在开始之前,请确保你已具备以下条件:

项目要求说明
Linux 服务器Ubuntu 22.04+ 或其他现代发行版推荐至少 2GB RAM 和 2 核 CPU。
基础 Linux 知识熟悉 SSH、命令行操作本教程将提供所有需要的命令。

第一步:准备服务器环境

基于安全最佳实践,我们强烈建议不要直接使用 root 用户运行任何应用程序。因此,第一步是创建一个专用的非特权用户,并为其授予 sudo 权限。

  1. root 用户 SSH 登录你的服务器
ssh root@your_server_ip
  1. 创建一个新用户(我们称之为 moltbot):
adduser moltbot

系统会提示你为新用户设置密码和其他信息。

  1. 将新用户添加到 sudo,以便执行需要管理员权限的操作:
usermod -aG sudo moltbot
  1. 切换到新用户
su - moltbot
  1. 更新系统软件包
sudo apt update && sudo apt upgrade -y

此后,所有操作都应在该 moltbot 用户下进行。

第二步:安装 Moltbot

Moltbot 的核心依赖是 Node.js (版本 ≥ 22)。官方提供了一个便捷的一键安装脚本,可自动处理相关依赖并完成 Moltbot 的安装。

  1. 执行官方安装脚本
curl -fsSL https://molt.bot/install.sh | bash

该脚本会自动检测您的操作系统,安装适配的 Node.js 版本,并部署 Moltbot 命令行工具(CLI)。

安装成功后,将自动进入交互式配置流程,如下图所示:

图片

第一个选项是风险提示,输入 yes 确认已知晓。

第二步在模式选择中,建议选择 QuickStart 以进行快速配置。

第三步:配置使用 Qwen

在选择模型服务商时,本教程以阿里云的 Qwen(通义千问)为例,其提供充足的免费额度,非常适合初学者入门。待您熟悉系统后,可根据需求更换为其他模型。

选择 Qwen 模型后,终端会提供一个授权链接,请复制并在浏览器中打开。

打开链接后,您会看到授权页面。如果您尚未登录阿里云账户,请根据提示完成登录。

登录成功后,系统会提示您选择具体的 Qwen 模型版本。

此处选择默认模型即可。随后,安装向导会询问是否配置频道(Channel),建议暂时跳过,后续再进行配置。

接着,在技能(Skills)配置步骤中,同样选择 No。

等待安装流程结束,在最后的启动方式选择中,选择 TUI(文本用户界面)。

如果成功进入 TUI 聊天界面,说明 Moltbot 已正确安装。您可以输入 Hello 进行初步测试。

测试无误后,使用 Ctrl+C 组合键退出 TUI 界面,以便进行后续的后台服务配置。

第四步:配置 Moltbot 并启动 Gateway

  1. 检查 Gateway 服务状态

安装向导会自动将 Gateway 配置为系统服务并在后台启动。

clawdbot gateway status

如果服务正常运行,您将看到 active (running) 的状态提示。

服务状态

如果检查状态时发现服务异常(例如 inactivefailed),可以尝试使用内置的诊断工具进行修复:

clawdbot doctor --repair

修复后,再次检查状态,应可看到服务已恢复正常。

服务状态

  1. 配置选项

您可以根据需要编辑 Moltbot 的主配置文件。

使用文本编辑器(如 vimnano)打开配置文件:

sudo vim ~/.clawdbot/clawdbot.json
  1. 重载并重启服务

每次修改配置文件后,需要重载服务以使更改生效。虽然某些版本的 Moltbot 支持自动重载,但手动重启是更可靠的方式。

systemctl --user daemon-reload
clawdbot gateway restart

至此,Moltbot Gateway 已在后台稳定运行,并监听本地端口,等待连接。

第五步:访问后台

Moltbot 自带一个 Web UI 用于管理和测试。运行以下命令可获取访问方式:

clawdbot dashboard

命令会输出一段 SSH 端口转发指令,格式如下:

后台访问方法

在您的本地电脑(而非服务器)上打开一个终端,并执行上述 SSH 命令。该命令会将服务器的 Web UI 端口 18789 映射到您本地的同名端口。

ssh -N -L 18789:127.0.0.1:18789 xxx@xxx.xxx.xxx.xxx -p xxx

命令运行后,在本地浏览器访问 http://127.0.0.1:18789 即可打开 Web UI 。

您可以发送一条 Hello 消息进行测试,如果收到回复,则说明一切正常。

后台界面

第六步:接入飞书

首先,在服务器上为 Moltbot 安装飞书插件:

clawdbot plugins install @m1heng-clawd/feishu

接下来,登录飞书开放平台,进入「开发者后台」,点击「创建企业自建应用」。

填写应用名称和描述后,完成创建。

创建成功后,进入应用的「凭据与基础信息」页面,复制并妥善保存 App ID 和 App Secret,它们将在后续配置中用到。

然后添加机器人,如下操作

暂时停留在飞书后台,我们先返回服务器终端,完成 Moltbot 的飞书相关配置。

飞书的其他配置先暂停,回到服务器配置 Clawdbot 的飞书参数

添加飞书配置

clawdbot config set channels.feishu.appId "飞书 app id"

clawdbot config set channels.feishu.appSecret "飞书 app secret"

clawdbot config set channels.feishu.enabled true

# 推荐使用 websocket
clawdbot config set channels.feishu.connectionMode websocket

clawdbot config set channels.feishu.dmPolicy pairing

clawdbot config set channels.feishu.groupPolicy allowlist

clawdbot config set channels.feishu.requireMention true

配置完成之后,重启

clawdbot gateway restart

重启完成后回到飞书,找到「事件和回调」,选择长连接模式,如下图

如果配置成功,说明连接已建立。继续下面的配置,添加事件,选择「接收消息」事件

事件添加完成之后,还需要开通权限,有以下权限全部勾选

权限Scope(范围)Description(说明)
contact:user.base:readonly用户信息获取基础用户信息
im:message消息 全部勾选发送和接收消息

如下图

以上步骤全部完成后,即可与机器人对话。但在此之前需要先创建一个版本

注意:每次修改配置后都需要重新发布版本,建议全部配置完成后再统一发布。

发布完成后,回到飞书客户端,可以看到应用已上线,点击打开应用

向机器人发送 Hello,即可收到 Moltbot 的回复

恭喜您!至此,您已成功在 Linux 服务器上部署了 Moltbot,并将其与飞书集成。现在您可以开始探索其强大的功能,或根据官方文档进行更深度的定制。如果在部署过程中遇到任何问题,欢迎留言交流。

常用命令

# 检查常见安全问题
clawdbot security audit

# 自动修复文件权限等问题
clawdbot security audit --fix

# 查看 Moltbot 日志
clawdbot logs --follow

参考文献

Windows 配置 Claude Code 解决 settings.json 不生效

作者 tl.s
2026年1月9日 16:28

TL;DR

默认情况下 Windows 安装的 Claude Code 会从这个位置读取配置:

C:\Users\<YOUR_NAME>\.claude

其他系统则类似的找到 ~/.claude 路径。

官方流程在安装结束后就完成了,可以直接登录使用。如果需要修改配置将 Claude Code 接入第三方 API,就需要修改这里的 settings.json 配置文件,可以使用 CC-Switch 或是手动编写,但是修改后会发现不生效。

经过调研,发现是最新版的 Claude Code 修改了一个参数,导致启动时一定要登录,这里可以手动修改一下 C:\Users\<YOUR_NAME>\.claude.json 中的这个参数,即可正常使用:

hasCompletedOnboarding: true

将配置中的 false 改为 true ,再重新运行 Claude Code 即可。

这个问题卡了我两天,一度打算放弃回到其他 OS,最后终于在 这里 找到了答案,因此特别记录一下。

Refs

Windows 配置 Claude Code 全流程

作者 tl.s
2026年1月9日 16:22

今天终于跑通了 Windows 下运行 Claoude Code 的全流程,不借助 WSL ,原生运行。起因是自己需要一个可以长期运行任务的云桌面,这方面还是 Windows 最好用。不得不说相比于 Linux/macOSWindows 下运行 Claude Code 实在太多坑了。

感谢 LD 巨佬 哈雷彗星《 Claude Code 终极版 FAQ 指南 》 ,在这份指南的指导下,还有开源社区朋友的帮助下,终于跑通了全流程,在这里记录一下,避免更多的朋友踩坑。

ClaudeCode

环境说明

先说一下我的环境:

  • OS: Windows 10 Enterprise LTSC 2021
  • Shell: PowerShell 5 -> PowerShell 7

我是在全新安装的 Windows 10 Enterprise LTSC 2021 系统中进行,使用系统自带的 PowerShell 5 开始 ,流程中会安装最新的 PoserShell 7 并使用 PS7 完成后续。根据论坛大佬的说法,PS7 很好用,自带的 PS5 不好用,因此下面的流程包含了安装 PS7 的流程,大家可以自己体验一下。

主要流程

以下内容主要来自 《 Claude Code 终极版 FAQ 指南 》 和实践补充。

打开 PowerShell 5 并安装 WinGet:

$progressPreference = 'silentlyContinue'
Install-PackageProvider -Name NuGet -Force | Out-Null
Install-Module -Name Microsoft.WinGet.Client -Force -Repository PSGallery | Out-Null
Write-Host "Using Repair-WinGetPackageManager cmdlet to bootstrap WinGet..."
Repair-WinGetPackageManager -AllUsers

PoserShell 5 中使用 WinGet 安装 PoserShell 7

winget install Microsoft.PowerShell

打开 PoserShell 7 并使用 winget 安装其他依赖,后续全部在 PS7 中进行:


# 必须,安装 fnm 用于管理 node 环境,或使用现有环境
winget install Schniz.fnm
# 可选,安装 Git
winget install --id Git.Git -e --source winget
# 可选,安装 notepad4
winget install zufuliu.notepad4
# 可选,安装 Windows Terminal
winget install -e --id Microsoft.WindowsTerminal
# 可选,安装 Notepad++
winget install -e --id Notepad++.Notepad++

下面使用 fnm 安装特定版本的 node 运行环境,首先需要对 FNM 给予一个环境启动,否则安装的 node 无法使用,因此首先预配置环境:

New-Item –Path $Profile –Type File –Force
notepad $profile

之后将以下内容写入该文件并保存:

fnm env --use-on-cd --shell powershell | Out-String | Invoke-Expression

下面开始正式安装 nodejs:

fnm install lts/krypton
fnm use lts/krypton

全局安装 Claude Code:

npm install -g @anthropic-ai/claude-code

之后就可以愉快的使用 Claude Code 了。

Claude Code  在 Windows PowerShell7 运行

常见问题

配置 settings.json 不生效

默认情况下这样安装的 Claude Code 会从这个位置读取配置:

C:\Users\<YOUR_NAME>\.claude

官方流程在安装结束后就完成了,可以直接登录使用。如果你需要修改配置将 Claude Code 接入第三方 API,就需要修改这里的 settings.json 配置文件,可以使用 CC-Switch 或是手动编写,但是修改后会发现不生效。

经过调研,发现是最新版的 Claude Code 修改了一个参数,导致启动时一定要登录,这里可以手动修改一下 C:\Users\<YOUR_NAME>\.claude.json 中的这个参数,即可正常使用:

hasCompletedOnboarding: true

将配置中的 false 改为 true ,再重新运行 Claude Code 即可。

这个问题卡了我两天,一度打算放弃回到其他 OS,最后终于在 这里 找到了答案,因此特别记录一下。

Refs

2025-12-31 | 年终总结

作者 tl.s
2026年1月1日 01:49

2025 年就这样过去了,做了很多事情,发生了很多事情,是转折的一年,是难忘的一年,是值得回味的一年。

这一年,生活轨迹发生了很堵变化,思想认识也发生了很多转变,突然要写年终总结,真不知道从何说起。就想到哪里说到哪里吧。

2025 年探索了巨多的事情。这一年是 AI 继续井喷式发展的一年,是各行各业不断变革的一年。年初的时候 AI Agent 成为各家媒体预言本年度的最热词。年底 Manus 被 Meta 收购,或许是 AI Agent 的高光时刻吧。与此同时 AI Agent 似乎正在深刻的改变各行各业。

就我个人而言,在不断的探索,可或许是眼界和能力还有限,能做出的价值很有限。能做的就是不断的尝试,不断地继续。

Git 提交

今年的 github 提交量挺可观的,因为我不断的在业余时间做各种尝试。

git-wrapped-songtianlun

相比于去年,提交量增加了非常非常多。这还只是在 github 的提交,工作时间内网的提交可能是这几倍吧。

自己的 forgejo 服务器可以看到的提交量也挺多。

image-20260101010450350

当然跟大佬们相比还是相差甚远,跟我自己去年相比,则是增加了很多了,对此我很满意。

价值感悟

价值创造。对于价值一词的理解逐渐深刻,特别是今年听到大湾鸡之父刘平云谈到吉祥物如何发挥价值的采访,我对价值这一词突然有一些具象的感觉。

听Manus联合创始人兼首席科学家季逸超(Peak)的访谈时,谈到他高中时期喜欢倒腾电脑,恰好那段时间乔布斯发布了 App story ,他做的软件在上面稳定的售卖,拿到现金流,足以向家长和老师证明自己做的事情是有价值的,让他可以放手去做。

我也在想,我如何创造或证明自己更大的价值呢。相比于那些比我更优秀的人,他们更早就想明白了这个问题,相比于那些更优秀的人慢一点,但一定要开始思考。

关于学习

2025 学了很多东西,像是有学习焦虑一样,学了很多东西,用在各种方面。

但不知怎么的,总感觉不够深入,创造的价值优先,探索更多的可能性。

截屏 2026-01-01 01.11.51

微信读书阅读的书籍变少了,因为我花费了更多的时间阅读纸质书和输出,还有做很多探索的事情。给自己定的目标是尽量每个月买一本纸质书,读完一本再买下一本。今年陆续买了几本,希望明年阅读更多一些。

除了阅读,还学习了很多技术和推广等等相关的知识,广泛的涉猎。我不希望自己局限在技术领域,我希望自己能够积极突破,让自己做的东西真正能为更多的人产生价值,为此我需要不断的尝试。除了努力学习,尽量深入的掌握所需领域知识,还需要学习增长、推广、运营等等相关的知识和思维。希望新的一年能够有所突破吧。

关于输出

最近自己我感觉是输出的低谷了。因为自己陷入了很多纠结和迷茫之中。

博客不想更新了,因为总感觉浅尝辄止,深入内容又很难把握方向,不知道如何发力。感觉没人会关心,自己在不断的思考。另一方面看着不断有访客进来,努力维持着博客不要倒。

除了博客,有一段时间尝试过日更公众号,每天想到什么,突然的灵感就用 flomo 记录下来,每天花 5-10 min 批量导出当天的想法,简单排版发公众号,坚持了几周放弃了,因为实在给自己太大压力了,感觉对健康状况有害,就暂时停止了。

关键问题还是方向和方法,这些问题我还须继续思考和尝试。

技术探索

今年在技术探索方面同样做了很多努力,从AI到容器化,从算力底座到业务落地,希望利用一切可以看到的东西来尝试。容器方面参与开源社区做了一些贡献,在编码的过程中学到很多,希望新的一年能够继续。AI方面确实不知道如何发力,总感觉有点投机的感觉,想做深入又很难抓住哪一个赚钱的行业。对于我个人状况不敢轻易去下注一个未知的领域,需要稳定上升。

在 AI 方面研究了不少东西,但效果有限,可能最直接的产出就是一些提示词的整理,还有用法的整理。用 AI 做了不少东西,发现在 AI 助力下自己到达瓶颈的速度变快了,做事情确实能够快速开始,同时也需要快速学习,找到方向和方法,重复直到拿到结果。

业余探索

在业余探索方面确实做了很多东西,业余时间做了 12+ 的项目,都是在不断尝试,有的技术栈不好,有的效率太低,有的还是自嗨,有的功能不稳定,有的 SEO 不好。总之在做的过程中学到的许多东西才不断深入的理解。有的感觉非常简单的东西和道理,也得自己亲手做一次才能深刻理解,没有实践就下结论很容易只是自己以为自己懂了。

个人博客

至于个人博客,从百度统计可以看到几年的 PV 不到 10w,UV 6w 左右。

image-20260101012740657

来源方面大部分还是直接访问,来自搜索引擎的只有 15%,说明 SEO 和内容方向还是有不小问题。不过个人博客,无所谓了,还是想些什么就写什么,有什么专题的东西在做专题的网站来整理。

image-20260101012843988

似乎浏览量的统计还是有点问题,从自己的 Plausible + GA4 看2025年的 UV 81k PV 108K

image-20260101013135126

来源方面有快一半都来自于搜索引擎,难道百度统计不统计来自于其他搜索引擎的流量?

image-20260101013224874

主要还是来自国内的访问,使用 Chrome 浏览器最多:

image-20260101013337511

关于分享欲

思考了超级多的东西,分享东西也总是一阵一阵的,方向举棋不定,投入的精力比较分散。这是我自己存在的问题,未来我想逐渐的改变。去感受身边的美好,尽量聚焦在一些事情上,分享的东西也可以逐渐多一些。自己探索的东西对其他人说不定真是很有价值的。特别的需要挖掘 Reddit 这类论坛,顺便提升英文水平。

身体健康

这一年去了不少次医院,虽然都是些小毛小病,可也是很难受的,一丁点的难受都可能让自己的精力大大受挫。现在这个年纪很多不舒服,都是在提醒自己该休息了。一些问题,医生听到最后只说,你可能需要放一个长假。

因此看开一些事情,不要过多较真,不要让自己情绪过多的波动。责任心强,做事情认真。这是我的优势,可如果为了不值得的事情影响了身体健康就太不值得了。应该将有限的注意力放在更有价值的事情上。

高光时刻

今年的高光时刻大概就是入选中华人民共和国第十五届运动会火炬手了吧。参与了火炬传递,还上了央视,央视主持人读出了我的名字,还有专门的镜头。虽然是团体,可这件事大概也值的自己吹一辈子的牛了。一同入选的大部分都是企业家和各行各业非常有影响力的人。不过还是提醒自己要低调做人,高调做事。该高调高调,该低调低调。希望后面越来越好。希望自己能够做更多有价值的事情。

关于自信

这也许是我这个人最大的缺点了吧。缺乏自信。而一个人的运气人生更好的一个大因素就是要自信,尤其最好是“盲目自信”,没有理由的自信。我已经在各方各面领先了很多人,我需要的是更自信的前进,追求更大的目标,创造更大的价值。因此,希望新的一年,提醒自己越来越自信。

写在最后

新的一年,继续努力。愿自己健康快乐,爱的人同样健康幸福,希望大家都能够快乐平安,愿新的一年一切都好。

AI 生图精品提示词|第二期:城市星球

作者 tl.s
2025年12月13日 15:04
若没有特别说明,默认使用 AiLoft 提供的 Nano Banana Pro 模型生成。

城市星球系列封面图

本次带来《城市星球》系列,先看效果图:

Refs: https://x.com/TechieBySA/status/1999577563295826208

广州城市星球

北京城市星球

厦门城市星球

香港城市星球

芝加哥城市星球

澳门城市星球

提示词如下:

Create a hyperrealistic miniature planet showcasing [GuangZhou] with famous landmarks seamlessly curving around the spherical surface. Position bold 3D white text ”[CITY]” naturally integrated across the lush green central parkland with realistic shadows and dimensional depth. Capture from a top-down orbiting angle that emphasizes the dramatic planet curvature. Use soft golden hour daylight filtering through partly cloudy skies, casting gentle shadows on emerald grass and surrounding trees. The background should blend into a swirling atmospheric sky. Apply vibrant greens, warm earth tones, and soft blues. Render in polished photorealistic style with fine architectural detail.

可以讲示例中的 GuangZhou 换成其他城市,例如:

[GuangZhou]
[BeiJing]
[XiaMen]
[ZhengZhou]
[HongKong]
[Chicago]
[Macau]

效果都很不错,快用 AiLoft 试试吧。

AI 生图精品提示词|第一期

作者 tl.s
2025年12月6日 14:47
若没有特别说明,默认使用 AiLoft 提供的 Nano Banana Pro 模型生成。

本期封面图:

英文封面图

中文封面图

3D Q 版迷你概念商店

Refs: https://x.com/dotey/status/1995190286775881780

说明:

🍌 nano banana prompt
3D chibi-style miniature concept store of {Brand Name}

提示词:

3D chibi-style miniature concept store of {Brand Name}, creatively designed with an exterior inspired by the brand's most iconic product or packaging (such as a giant {brand's core product, e.g., chicken bucket/hamburger/donut/roast duck}). The store features two floors with large glass windows clearly showcasing the cozy and finely decorated interior: {brand's primary color}-themed decor, warm lighting, and busy staff dressed in outfits matching the brand. Adorable tiny figures stroll or sit along the street, surrounded by benches, street lamps, and potted plants, creating a charming urban scene. Rendered in a miniature cityscape style using Cinema 4D, with a blind-box toy aesthetic, rich in details and realism, and bathed in soft lighting that evokes a relaxing afternoon atmosphere. --ar 2:3

Brand name: Starbucks

效果图:
Starbucks
3D chibi-style miniature concept store of Starbucks

麦当劳
3D chibi-style miniature concept store of 麦当劳

McDonald
3D chibi-style miniature concept store of McDonald

和明星自拍

refs: https://x.com/canghecode/status/1996593241421181403

提示词:

“我在[电影名称]的片场和[电影角色]自拍。

保持人物与参考图像完全一致,面部特征、骨骼结构、肤色、表情、姿势和外貌 100%相同。1:1 宽高比,4K 细节。”

实例:

参考图片:

哈利波特

我在[疯狂动物城]的片场和[Judy Hopps]、[Nick Wilde]自拍。

保持人物与参考图像完全一致,面部特征、骨骼结构、肤色、表情、姿势和外貌 100%相同。1:1 宽高比,4K 细节。

效果图:

哈利波特在疯狂动物城拍摄现场与主演合影

3D 微缩电影场景

refs: https://x.com/tsubame_99/status/1997181713546727621
Present an exquisite, miniature 3D cartoon-style scene of the user-specified subject, clearly viewed from a 45° top-down perspective.

Place the subject’s most iconic representation, character, or landmark prominently at the center, complemented by proportionally-sized icons of key elements, symbolic items, charming figures, props, and other details illustrating the subject’s core theme or narrative. The scene should be detailed, finely crafted, and playful.

Rendered with Cinema 4D, the modeling should be refined, smoothly rounded, and rich in texture, accurately capturing realistic PBR materials. Gentle, lifelike lighting and soft shadows should create a warm, comfortable ambiance.

Maintain a clean, minimalist layout and a solid-color background to highlight the primary content.

At the top-center of the scene, prominently display the subject name in a large font size.

All texts should be displayed in the language specified or entered by the user, without any background, and may subtly overlap with the scene elements to enhance overall design integration.

**Parameters:**
Aspect ratio: {User input, default 1:1}
Subject Name / Topic: {User input}

实例

Subject Name / Topic: Squid Game
Subject Name / Topic: 名侦探柯南

效果图:

名侦探柯南 3D 微缩电影场景

3D 动态城市天气场景

refs: https://x.com/dotey/status/1993729800922341810?s=46

说明:

🍌 nano banana pro prompt (with gemini)

Dynamically generate a current weather card based on a given city name.

提示词:

Present a clear, 45° top-down view of a vertical (9:16) isometric miniature 3D cartoon scene, highlighting iconic landmarks centered in the composition to showcase precise and delicate modeling.

The scene features soft, refined textures with realistic PBR materials and gentle, lifelike lighting and shadow effects. Weather elements are creatively integrated into the urban architecture, establishing a dynamic interaction between the city's landscape and atmospheric conditions, creating an immersive weather ambiance.

Use a clean, unified composition with minimalistic aesthetics and a soft, solid-colored background that highlights the main content. The overall visual style is fresh and soothing.

Display a prominent weather icon at the top-center, with the date (x-small text) and temperature range (medium text) beneath it. The city name (large text) is positioned directly above the weather icon. The weather information has no background and can subtly overlap with the buildings.

The text should match the input city's native language.
Please retrieve current weather conditions for the specified city before rendering.

City name:【广州】

效果图

广州动态城市天气场景

3D 诗词场景

Refs: https://x.com/ykszs017/status/1997268159800434783
A vertical (9:16), intricately detailed 45° top-down isometric miniature 3D cartoon scene. The central composition is a richly composed and densely modeled visualization of an ancient Chinese poem's imagery. The scene features soft but highly tactile PBR textures with matte materials (like hand-sculpted clay with visible tool marks, aged wood grain, rough ceramic, and textured fabrics). Layered, lyrical lighting with soft shadows and ambient occlusion emphasizes the depth of every small feature. Key visual elements derived from the verses are creatively integrated into the landscape. The scene must be densely filled with micro-details to create a lived-in, immersive poetic ambiance: add scattered pebbles, moss in cracks, fallen leaves, dew drops on plants, intricate architectural carvings, tiny wildlife (like birds or fish), and environmental clutter relevant to the poem. Use a clean, unified composition that balances richness with aesthetics, set against a soft, solid-colored background that complements the scene's color palette. The overall visual style is fresh, soothing, yet elaborate and full of life. Text layout: At the top-center, display the poem title in large, elegant Chinese font. Directly beneath it, display the Dynasty and Author in medium font. The core verses of the poem are arranged below the main scene or subtly floating near relevant objects within the composition, with no background plate. Poem Title: 【天净沙·秋思】 Dynasty/Author: 【元·马致远】 Core Verses Context: 枯藤老树昏鸦,小桥流水人家,古道西风瘦马。夕阳西下,断肠人在天涯 。(Visual Focus: An old tree, tightly entwined with withered vines, has cracked branches, and a few crows are perched or flapping their wings. A small stone bridge spanning a cold and clear stream, with mist and moisture clinging to the water surface; Two or three compact houses by the stream, with very faint smoke from their chimneys, were torn apart by the cold wind. An ancient path, dusty by the autumn wind, extends deeper into the picture. A bony and thin horse, with dull fur, had its mane flapped sideways by the wind. A solitary traveler, wearing an old cloak, with a slender figure, strolled slowly in the wind. The withered and yellow grass lay on the ground, and the cracked dirt road had ruts. The setting sun was close to the horizon, its golden-red rays shattered by the wind, casting long shadows on the travelers, horses and ancient paths. The overall scene presents a desolate, dry and cold, and spacious atmosphere of late autumn.)

效果图:

天净沙·秋思

Kubernetes kubectl --raw 使用指南

作者 tl.s
2025年11月10日 21:08

什么是 kubectl --raw?

kubectl --raw 是一个强大的底层工具,允许你直接访问 Kubernetes API Server 的 REST API,绕过 kubectl 的客户端逻辑、准入控制器(Admission Controllers)和 Webhook。

为什么需要 --raw?

标准 kubectl 的请求流程

kubectl 命令
    ↓
客户端验证和处理
    ↓
Admission Controllers
    ↓
Mutating Webhooks (修改请求)
    ↓
Validating Webhooks (验证请求)
    ↓
API Server 存储到 etcd

kubectl --raw 的请求流程

kubectl --raw
    ↓
直接 HTTP 请求到 API Server
    ↓
绕过大部分中间件
    ↓
直接操作 etcd

适用场景

  1. 绕过 Webhook 干扰 - 当 Mutating/Validating Webhook 阻止正常操作时
  2. 调试 API Server - 排查 kubectl 客户端与 API Server 的交互问题
  3. 访问特殊端点 - 访问 metrics、healthz 等非资源端点
  4. 绕过客户端限制 - kubectl 版本不支持某些新特性时
  5. 性能测试 - 直接测试 API Server 响应时间
  6. 修复僵尸资源 - 清理被控制器锁定的资源状态

基本语法

# 基本格式
kubectl get --raw <API-PATH>

# 或在某些版本中
kubectl --raw <API-PATH>

常用操作示例

1. GET 请求 - 查询资源

查看集群级别资源

# 获取所有节点
kubectl get --raw /api/v1/nodes | jq .

# 获取特定节点
kubectl get --raw /api/v1/nodes/node-name | jq .

# 获取节点状态
kubectl get --raw /api/v1/nodes/node-name/status | jq .

# 获取所有命名空间
kubectl get --raw /api/v1/namespaces | jq .

查看命名空间级别资源

# 获取 default 命名空间的所有 Pod
kubectl get --raw /api/v1/namespaces/default/pods | jq .

# 获取特定 Pod
kubectl get --raw /api/v1/namespaces/default/pods/pod-name | jq .

# 获取 Deployment
kubectl get --raw /apis/apps/v1/namespaces/default/deployments/deploy-name | jq .

# 获取 Service
kubectl get --raw /api/v1/namespaces/default/services/svc-name | jq .

查看子资源

# Pod 日志
kubectl get --raw /api/v1/namespaces/default/pods/pod-name/log

# Pod 状态
kubectl get --raw /api/v1/namespaces/default/pods/pod-name/status | jq .

# Service 的 Endpoint
kubectl get --raw /api/v1/namespaces/default/endpoints/service-name | jq .

2. PUT 请求 - 完整更新资源

# 更新节点(先获取,修改,再替换)
kubectl get --raw /api/v1/nodes/node-name > node.json

# 编辑 node.json 文件
vim node.json

# 替换(注意:不同版本语法可能不同)
kubectl replace --raw /api/v1/nodes/node-name -f node.json

# 或使用 kubectl proxy 方式
kubectl proxy --port=8001 &
curl -X PUT \
  -H "Content-Type: application/json" \
  -d @node.json \
  http://localhost:8001/api/v1/nodes/node-name

实战案例:清除节点僵尸条件

# 获取节点当前状态
kubectl get --raw /api/v1/nodes/node-name > /tmp/node.json

# 使用 jq 删除特定条件
jq 'del(.status.conditions[] | select(.type == "EtcdIsVoter"))' \
  /tmp/node.json > /tmp/node-fixed.json

# 更新节点状态
kubectl replace --raw /api/v1/nodes/node-name/status -f /tmp/node-fixed.json

3. POST 请求 - 创建资源

# 创建 Pod
cat > pod.json <<EOF
{
  "apiVersion": "v1",
  "kind": "Pod",
  "metadata": {
    "name": "test-pod",
    "namespace": "default"
  },
  "spec": {
    "containers": [{
      "name": "nginx",
      "image": "nginx:latest"
    }]
  }
}
EOF

kubectl create --raw /api/v1/namespaces/default/pods -f pod.json

4. DELETE 请求 - 删除资源

# 删除 Pod
kubectl delete --raw /api/v1/namespaces/default/pods/pod-name

# 使用 kubectl proxy 方式
kubectl proxy --port=8001 &
curl -X DELETE http://localhost:8001/api/v1/namespaces/default/pods/pod-name

5. PATCH 请求 - 部分更新

# JSON Patch (精确的操作指令)
kubectl patch --raw /api/v1/nodes/node-name \
  --type='json' \
  -p='[
    {"op": "add", "path": "/metadata/labels/new-label", "value": "new-value"},
    {"op": "remove", "path": "/status/conditions/0"}
  ]'

# Strategic Merge Patch (合并式更新)
kubectl patch --raw /api/v1/nodes/node-name \
  --type='merge' \
  -p '{
    "metadata": {
      "labels": {
        "environment": "production"
      }
    }
  }'

# Merge Patch (简单合并)
kubectl patch --raw /api/v1/nodes/node-name \
  --type='merge' \
  -p '{"spec":{"unschedulable":true}}'

API 路径规则

核心 API 组 (Core API Group)

# 格式
/api/v1/<resource-type>                        # 集群级别
/api/v1/namespaces/<namespace>/<resource-type> # 命名空间级别

# 示例
/api/v1/nodes
/api/v1/nodes/node-name
/api/v1/nodes/node-name/status
/api/v1/namespaces/default/pods
/api/v1/namespaces/default/pods/pod-name
/api/v1/namespaces/default/services

命名 API 组 (Named API Groups)

# 格式
/apis/<group>/<version>/<resource-type>
/apis/<group>/<version>/namespaces/<ns>/<resource-type>

# 常用 API 组示例
/apis/apps/v1/deployments                           # Deployment
/apis/apps/v1/namespaces/default/deployments
/apis/batch/v1/cronjobs                             # CronJob
/apis/networking.k8s.io/v1/ingresses                # Ingress
/apis/rbac.authorization.k8s.io/v1/clusterroles     # ClusterRole
/apis/storage.k8s.io/v1/storageclasses              # StorageClass

子资源 (Subresources)

# 状态子资源
/api/v1/nodes/<name>/status
/apis/apps/v1/namespaces/<ns>/deployments/<name>/status

# 日志
/api/v1/namespaces/<ns>/pods/<name>/log
/api/v1/namespaces/<ns>/pods/<name>/log?container=container-name

# 执行命令
/api/v1/namespaces/<ns>/pods/<name>/exec

# 端口转发
/api/v1/namespaces/<ns>/pods/<name>/portforward

# 代理
/api/v1/nodes/<name>/proxy
/api/v1/namespaces/<ns>/pods/<name>/proxy
/api/v1/namespaces/<ns>/services/<name>/proxy

特殊端点

查看 API 资源

# 列出所有 API 版本
kubectl get --raw /apis | jq '.groups[].name'

# 查看特定 API 组
kubectl get --raw /apis/apps/v1 | jq .

# 列出所有可用资源
kubectl get --raw /api/v1 | jq '.resources[].name'

# OpenAPI 规范
kubectl get --raw /openapi/v2 | jq . > openapi.json

集群信息

# 版本信息
kubectl get --raw /version | jq .

# 健康检查
kubectl get --raw /healthz
kubectl get --raw /livez
kubectl get --raw /readyz

# API Server 标志
kubectl get --raw /debug/flags/v

# Metrics
kubectl get --raw /metrics

认证和授权

# 检查当前用户权限
kubectl get --raw /apis/authorization.k8s.io/v1/selfsubjectaccessreviews \
  -X POST \
  -d '{
    "apiVersion": "authorization.k8s.io/v1",
    "kind": "SelfSubjectAccessReview",
    "spec": {
      "resourceAttributes": {
        "namespace": "default",
        "verb": "get",
        "resource": "pods"
      }
    }
  }'

使用 kubectl proxy 的方式

kubectl --raw 不可用或语法复杂时,可以使用 proxy 方式:

# 启动代理
kubectl proxy --port=8001 &

# 使用 curl 访问
curl http://localhost:8001/api/v1/nodes | jq .

# GET 请求
curl http://localhost:8001/api/v1/namespaces/default/pods

# POST 请求
curl -X POST \
  -H "Content-Type: application/json" \
  -d @pod.json \
  http://localhost:8001/api/v1/namespaces/default/pods

# PUT 请求
curl -X PUT \
  -H "Content-Type: application/json" \
  -d @node.json \
  http://localhost:8001/api/v1/nodes/node-name/status

# DELETE 请求
curl -X DELETE \
  http://localhost:8001/api/v1/namespaces/default/pods/pod-name

# 停止代理
pkill -f "kubectl proxy"

实战案例

案例 1: 绕过 Webhook 修改节点标签

# 问题:Mutating Webhook 拦截标签修改
# 解决:直接通过 API 修改

# 1. 获取节点
kubectl get --raw /api/v1/nodes/node-name > node.json

# 2. 使用 jq 添加标签
jq '.metadata.labels["custom-label"] = "custom-value"' node.json > node-updated.json

# 3. 替换节点
kubectl replace --raw /api/v1/nodes/node-name -f node-updated.json

案例 2: 清理僵尸 Finalizer

# 问题:资源因 finalizer 无法删除
# 解决:直接清空 finalizers

# 1. 获取资源
kubectl get --raw /api/v1/namespaces/stuck-namespace > ns.json

# 2. 清空 finalizers
jq '.spec.finalizers = []' ns.json > ns-clean.json

# 3. 更新
kubectl replace --raw /api/v1/namespaces/stuck-namespace/finalize -f ns-clean.json

案例 3: 批量查询资源状态

#!/bin/bash
# 批量检查节点状态

for node in $(kubectl get nodes -o name | cut -d/ -f2); do
  echo "=== Node: $node ==="
  kubectl get --raw /api/v1/nodes/$node/status | \
    jq -r '.status.conditions[] | select(.type=="Ready") | 
    "Status: \(.status), Reason: \(.reason)"'
done

案例 4: 性能测试

#!/bin/bash
# 测试 API Server 响应时间

echo "Testing API Server performance..."
for i in {1..10}; do
  time kubectl get --raw /api/v1/nodes > /dev/null 2>&1
done

案例 5: 导出所有资源

#!/bin/bash
# 导出命名空间的所有资源

NAMESPACE="default"
OUTPUT_DIR="./k8s-backup"
mkdir -p $OUTPUT_DIR

# 导出 Pods
kubectl get --raw /api/v1/namespaces/$NAMESPACE/pods | \
  jq . > $OUTPUT_DIR/pods.json

# 导出 Services
kubectl get --raw /api/v1/namespaces/$NAMESPACE/services | \
  jq . > $OUTPUT_DIR/services.json

# 导出 Deployments
kubectl get --raw /apis/apps/v1/namespaces/$NAMESPACE/deployments | \
  jq . > $OUTPUT_DIR/deployments.json

echo "Backup completed in $OUTPUT_DIR"

注意事项

1. 权限要求

# 需要相应的 RBAC 权限
# 检查权限
kubectl auth can-i get nodes
kubectl auth can-i update nodes

2. resourceVersion 冲突

# 更新时可能遇到冲突
# Error: the object has been modified; please apply your changes to the latest version

# 解决:重新获取最新版本
kubectl get --raw /api/v1/nodes/node-name > node-latest.json
# 重新修改并更新

3. 数据格式验证

# 使用 jq 验证 JSON 格式
cat resource.json | jq . > /dev/null

# 如果有错误会提示

4. 备份重要资源

# 在修改前务必备份
kubectl get --raw /api/v1/nodes/node-name > node-backup-$(date +%Y%m%d).json

5. 只读操作优先

# 先用 GET 查看,确认无误后再 PUT/PATCH
kubectl get --raw /api/v1/nodes/node-name | jq .

版本兼容性

Kubernetes 1.18+

kubectl get --raw /api/v1/nodes
kubectl create --raw /api/v1/namespaces/default/pods -f pod.json
kubectl replace --raw /api/v1/nodes/node-name -f node.json
kubectl patch --raw /api/v1/nodes/node-name --type=merge -p '{...}'
kubectl delete --raw /api/v1/namespaces/default/pods/pod-name

早期版本或不支持时

# 使用 kubectl proxy
kubectl proxy --port=8001 &
curl http://localhost:8001/api/v1/nodes

调试技巧

1. 查看完整请求

# 增加日志级别
kubectl get --raw /api/v1/nodes -v=8

2. 使用 jq 过滤输出

# 只查看节点名称
kubectl get --raw /api/v1/nodes | jq '.items[].metadata.name'

# 查看 Pod 状态
kubectl get --raw /api/v1/namespaces/default/pods | \
  jq '.items[] | {name: .metadata.name, status: .status.phase}'

3. 格式化时间戳

# 转换时间格式
kubectl get --raw /api/v1/nodes/node-name | \
  jq '.metadata.creationTimestamp | fromdate | strftime("%Y-%m-%d %H:%M:%S")'

总结

kubectl --raw 是 Kubernetes 的"瑞士军刀",提供了:

直接访问 API - 绕过客户端限制
调试工具 - 排查 kubectl 和 API Server 问题
应急修复 - 处理 Webhook 和控制器导致的问题
性能测试 - 直接测试 API Server
学习工具 - 理解 Kubernetes API 结构

⚠️ 使用场景: 作为最后的调试和修复手段
⚠️ 不推荐: 日常操作应使用标准 kubectl 命令
⚠️ 需谨慎: 直接操作可能破坏资源状态

参考资源

磁盘使用分析工具对比:du vs ncdu vs gdu vs dust

作者 tl.s
2025年11月3日 23:14
Claude Sonnet 4.5 协助编写。

在日常的系统管理和磁盘空间清理工作中,我们经常需要分析磁盘使用情况。本文将对比四个常用的磁盘使用分析工具:传统的 du、经典的交互式工具 ncdu、现代化的 gdudust

工具简介

du (Disk Usage)

du 是 Unix/Linux 系统自带的经典磁盘使用分析工具,已经存在了几十年。它是最基础、最通用的选择。

ncdu (NCurses Disk Usage)

ncdu 是基于 ncurses 库的磁盘使用分析工具,提供了简洁的交互式文本界面。它是最早流行的交互式磁盘分析工具之一。

gdu (Go Disk Usage)

gdu 是用 Go 语言编写的现代化磁盘分析工具,提供了交互式界面和更快的扫描速度。

dust (du + rust = dust)

dust 是用 Rust 编写的磁盘使用分析工具,以更直观的可视化输出为特色。

功能对比

特性duncdugdudust
交互式界面
扫描速度中等中等
可视化输出基础中等强大优秀
系统自带
内存占用
删除文件功能
编程语言CCGoRust
易用性

使用示例

du 基本用法

# 显示当前目录大小
du -sh

# 显示所有子目录大小并排序
du -h --max-depth=1 | sort -hr

# 显示最大的10个目录
du -h | sort -rh | head -10

优点:

  • 系统自带,无需安装
  • 稳定可靠,脚本友好
  • 广泛的兼容性

缺点:

  • 速度较慢
  • 输出不够直观
  • 缺少交互功能

ncdu 基本用法

# 安装
# Debian/Ubuntu
sudo apt install ncdu

# macOS
brew install ncdu

# RHEL/CentOS
sudo yum install ncdu

# 分析当前目录
ncdu

# 分析指定目录
ncdu /path/to/directory

# 扫描时排除某些目录
ncdu --exclude /path/to/exclude

# 导出结果到文件(可在其他机器上查看)
ncdu -o result.json
ncdu -f result.json  # 读取导出的文件

交互式操作:

  • ↑↓j/k: 上下移动
  • Enter: 进入目录
  • : 返回上级目录
  • d: 删除选中的文件/目录
  • g: 显示百分比/图形条
  • n: 按名称排序
  • s: 按大小排序
  • q: 退出

优点:

  • 成熟稳定,广泛使用
  • 交互式界面简洁清晰
  • 可以直接删除文件
  • 支持导出和导入扫描结果
  • 内存占用合理
  • 在大多数发行版仓库中可用

缺点:

  • 扫描速度比 gdu 慢
  • 界面相对传统,不如 gdu 美观
  • 大型目录扫描时需要等待

gdu 基本用法

# 安装
# macOS
brew install gdu

# Linux
curl -L https://github.com/dundee/gdu/releases/latest/download/gdu_linux_amd64.tgz | tar xz
sudo mv gdu /usr/local/bin/

# 分析当前目录
gdu

# 分析指定目录
gdu /path/to/directory

# 非交互模式
gdu -n /path/to/directory

优点:

  • 扫描速度极快
  • 交互式 TUI 界面,可以用键盘导航
  • 可以直接在界面中删除文件
  • 支持彩色输出
  • 可以显示进度条

缺点:

  • 需要单独安装
  • 交互模式在某些脚本场景下不适用

dust 基本用法

# 安装
# macOS
brew install dust

# Linux
cargo install du-dust

# 基本使用
dust

# 分析指定目录
dust /path/to/directory

# 显示更多层级
dust -d 3

# 只显示目录
dust -t

优点:

  • 树状图可视化输出,非常直观
  • 彩色条形图显示占用比例
  • 输出清晰易读
  • 速度较快
  • 默认排序输出

缺点:

  • 需要单独安装
  • 没有交互式界面
  • 相对 du 功能较新,可能有兼容性问题

实际使用场景推荐

选择 du 的场景

  • 在生产服务器上进行快速检查
  • 编写自动化脚本
  • 需要最大兼容性
  • 系统资源受限

选择 ncdu 的场景

  • 需要交互式浏览但服务器上没有 gdu
  • 偏好传统稳定的工具
  • 需要导出扫描结果到其他机器分析
  • 在资源受限的系统上需要交互功能
  • 系统包管理器中已有 ncdu

选择 gdu 的场景

  • 需要深入分析大型目录结构
  • 需要最快的扫描速度
  • 追求现代化的交互体验
  • 在个人工作站上使用
  • 经常处理超大目录

选择 dust 的场景

  • 需要快速浏览目录大小
  • 偏好可视化输出
  • 想要更现代化的工具体验
  • 不需要交互式操作
  • 需要快速生成报告

性能对比

在一个包含 50GB 数据、约 100,000 个文件的目录上测试:

  • du: ~8 秒
  • ncdu: ~5 秒(扫描阶段)
  • gdu: ~2 秒
  • dust: ~4 秒

注:实际性能取决于硬件配置、文件系统类型和文件数量。ncdu 的优势在于扫描后的交互浏览非常流畅。

总结

四个工具各有千秋:

  • du 是经典之选,适合脚本和生产环境
  • ncdu 是稳定可靠的交互式工具,兼具易用性和可用性
  • gdu 是性能之王,提供强大的交互功能和最快速度
  • dust 是可视化专家,输出最为直观

工具演进历史

这四个工具代表了磁盘分析工具的演进过程:

  1. du (1970s): 命令行时代的基础工具
  2. ncdu (2007): 加入交互式界面,提升用户体验
  3. gdu (2020): 现代编程语言带来的性能提升
  4. dust (2018): 注重可视化和用户友好度

我的推荐

对于日常使用,我的建议是:

  1. 保留 du 用于脚本和快速检查
  2. 安装 ncdu 作为通用的交互式工具(服务器友好)
  3. 安装 gdu 用于深入的磁盘分析(个人工作站)
  4. 安装 dust 用于快速浏览和可视化

如果只能选一个额外工具:

  • 服务器环境: 选 ncdu(稳定、轻量、可靠)
  • 个人电脑: 选 gdu(快速、现代、强大)
  • 快速查看: 选 dust(直观、美观、高效)

根据具体需求选择合适的工具,甚至可以将它们组合使用,发挥各自的优势。

参考链接

2025 年黑五云主机活动汇总(含双十一)

作者 tl.s
2025年11月3日 23:14

2025 年主机厂商黑五活动集合,持续更新。

2025 年黑色星期五为 11月28日

内容收集自 Telegram、NodeSeek、LowEndTalk 等途径,仅做信息汇集,不做购买建议。
链接默认无 aff

ChangeLog

  • 2025-11-24

    • TapHip
    • YINNET
  • 2025-11-07

    • ColoCrossing
    • ISIF
  • 2025-11-05

    • DMIT
    • YINNET
    • YXVM
    • CloudIPLC
  • 2025-11-04

    • ZgoCloud
    • DogYun
    • OVH
    • netcup
  • 2025-11-03

    • HostDare

云主机

HostDare

2025-11-03

使用优惠码 BF2025,即可享受洛杉矶 CN2 GIA - CAMD/CSSD 套餐 35% 持续折扣

探索CSSD NVMe VPS方案:https://bill.hostdare.com/store/premium-china-optimized-nvme-kvm

探索CAMD VPS套餐:https://bill.hostdare.com/store/cheap-amd-nvme-kvm-vps-la-usa


使用优惠码 BFBIRD25 即可享受 日本JSSD/NKVM 套餐 25% 持续折扣

探索JSSD NVMe VPS套餐:https://bill.hostdare.com/store/premium-japan-kvm-vps

探索NKVM NVMe VPS套餐请点击:https://bill.hostdare.com/store/cheap-nvme-kvm-vps-japan


使用优惠码 BF2025SALE 即可享受 65% 持续折扣,订阅 经济型洛杉矶 NVMe/HDD 方案

探索 NVMe SSD VPS 方案:https://bill.hostdare.com/store/los-angeles-ssd-kvm

探索AMD VPS套餐:https://bill.hostdare.com/store/cheap-amd-nvme-kvm-vps-la-usa


使用优惠码 LOYAL25 即可享受 保加利亚 NVMe SSD 套餐 25% 持续折扣

探索保加利亚 NVMe 套餐:https://bill.hostdare.com/store/bg-ssd-kvm

通过DeepL.com(免费版)翻译


优惠码 BF2025SALE 可享 65% 折扣,并获赠 免费内存及带宽升级

ASSD0
1 vCPU Core ( AMD EPYC )
10 GB NVMe SSD Storage
768 MB RAM+768 MB RAM*
500 GB + 500 GB Bandwidth/month*
200 Mbps public port
1 IPv4, /64 IPv6 Address
KVM Virtualization
$9.80/year
Order Now

ASSD1
1 vCPU Core ( AMD EPYC )
25 GB NVMe SSD Storage
1 GB RAM+1GB RAM*
1000 GB +1000 GB Bandwidth/month*
500 Mbps public port
1 IPv4, /64 IPv6 Address
KVM Virtualization
$14.70/year
Order Now

ASSD2
2 vCPU Cores ( AMD EPYC )
50 GB NVMe SSD Storage
2 GB RAM+ 2GB RAM*
2000 GB+2000GB Bandwidth/month*
500 Mbps public port
1 IPv4, /64 IPv6 Address
KVM Virtualization
$26.25/year
Order Now


优惠码 BF2025 可享 35% 折扣,并免费升级内存及带宽/端口。

CSSD0
1 vCPU Core
10 GB SSD NVMe Storage
768 MB + 768 MB RAM*
250 GB +250 GB Bandwidth/month*
100 Mbps CN2,GIA,CU,CM Network*
1 dedicated IPv4
Price: $23.39/year
Order Now

CSSD1
1 vCPU Cores
25 GB SSD NVMe Storage
1 GB + 1 GB RAM*
600 GB +600 GB Bandwidth/month*
100 Mbps CN2,GIA,CU,CM Network*
1 dedicated IPv4
Price: $33.59/year
Order Now

CSSD2
2 vCPU Cores
50 GB SSD NVMe Storage
2 GB +2 GB RAM*
1000 GB +1000 GB Bandwidth/month*
100 Mbps CN2,GIA,CU,CM Network*
1 dedicated IPv4
Price: $51.59/year
Order Now


促销条款与细则

凡订购任何ASSD/SSD/CSSD/CAMD虚拟专用服务器并选择年度付款方案,即可获赠双倍内存及带宽/端口升级。本优惠仅适用于年度或更长期限付款方案。

  • 标注项为免费升级。所有折扣优惠券仅限3期付款方案使用。

所有非托管VPS服务均包含:

全天候高效技术支持
即时部署服务
自定义反向DNS
VNC远程控制台
企业级服务器硬件技术
Virtualizor控制面板(支持启动/重启/关机/重装)
高端带宽
自有硬件及IP地址(运营自2016年)
99.9%运行时间保障
更多服务!

数据中心测试IP及下载文件: Test File

netcup

2025-11-04

限时抢购提醒 ⚡

纽伦堡地区限时特惠:RS 1000 G12 Pro 套餐,SSD 容量翻倍,最低仅需签约3个月!

🔥 抓紧时间——优惠仅限今日(2025-11-04)下午5点前有效!

👉  https://sohub.io/js3x

OVH

2025-11-04

🛍 OVH KS-LE sale is now live on the website:

https://eco.ovhcloud.com/en/

🖥️ OVH 服务器查询: https://kele.ge/ovh.html

2025-11-19

OVH黑五页面正式上线
https://www.ovhcloud.com/en-ie/black-friday/

美区OVH促销上线
https://us.ovhcloud.com/deals/

狗云 | DogYun

2025-11-04

双11 · 促销活动

活动一:
折扣码“11.11”:新开弹性云7折;新开经典云(特价机除外)8折。
折扣码“jian100”:新开物理服务器优惠100元。

活动二:
11月5日-11月11日,单笔充值每满11元送1元。

活动三:
11月5日-11月11日,幸运大转盘每日抽取5折码,流量,余额等奖品。

活动四:
两款物理服务器预售,11月4日开始逐步交付
预售地址:https://ds.dogyun.com/server/preorder

ZgoCloud

2025-11-04

ZgoCloud 2025双十一活动

一:洛杉矶非优化线路,解锁还行

1C / 1G / 20G SSD / 2T月流量 / 15刀年付
https://clients.zgovps.com/?cmd=cart&action=add&id=93

二:洛杉矶国内优化线路,9929 + CMIN2线路

1C / 768MB / 15G SSD / 600G月流量 / 30刀年付
https://clients.zgovps.com/?cmd=cart&action=add&id=39

支持3天内流量不超过10G,全额退款(套餐一不支持以网络原因退款)

DMIT

2025-11-05

DMIT促销款补货

联通移动的建议上DMIT.LAX.EB ,电信上DMIT.LAX.Pro

DMIT.LAX.Pro 马里布 补货
MALIBU 1TB@1Gbps=49.9USD/yr
https://www.dmit.io/aff.php?pid=186
三网 CTG GIA(aka CN2 GIA) 回程
电信 CTG GIA 去程
联通 CUG VIP(aka 9929) 去程
移动 CMIN2 去程
另IPv6双程三网优化已是最佳路由

DMIT.LAX.EB WEE 补货
WEE 1TB@1Gbps=39.9USD/yr
https://www.dmit.io/aff.php?pid=188
联通 CUG VIP(Aka 9929) 去程 / 回程
电信 CN2 GIA 去程 CUG VIP 回程
移动 CMIN2 去程 / 回程
另IPv6双程三网优化已是最佳路由

YINNET

2025-11-05

📢 YINNET 限時優惠|六折促銷代碼限量 300 個

尊敬的用戶您好,

為回饋新老客戶的支持,YINNET 現推出限時促銷活動:

✅ 全場 VPS 產品 6 折優惠
✅ 限量 300 組優惠代碼:E7F1RHKT1I
✅ 每位用戶、每張訂單限使用一次
✅ 售完即止,先用先得

👉 登入官網立即使用:
https://www.yin-net.com/aff.php?aff=27&gocar=true
若您近期有伺服器、跨境應用部署、遊戲伺服器、跨國網站、加速業務等場景需求,這將是最划算的入手機會。

如有產品選型或技術疑問,歡迎隨時聯繫我們的客服團隊,我們將竭誠協助。

感謝您一直以來的信任與支持!
YINNET 團隊

YXVM

2025-11-05

🎉 YXVM · 双十一活动预热 🎉
🚀 新品上架 · 超值独立服务器 即刻开售!

双十一即将到来啦!YXVM 为您带来 性价比爆表的日本独立服务器新品
更稳网络、更便宜价格,一站满足建站、中转、数据库等多种需求!
采用超微 MS5039 ,所有服务器均支持 IPMI


✅ 新品亮点

  • CPU:E3-1230v6
  • RAM: DDR4 UDIMM ECC 16G
  • 存储:250G SSD
  • 带宽:1Gbps 独享
  • 流量:无限制(公平使用)
  • IP:/29 IPv4 段(5个可用IP)
  • 测试IP:142.248.148.2
  • 价格:月付 $59 USD / 年付 $499 USD

👉 立即订购
👉

开通日期 11.11
请注意这款产品有14个C段,IP地址为随机分配/29,您拿到的机器与测试IP可能不在一个C段。
我们不支持更换指定地址请求。

CloudIPLC

2025-11-05

✨【CloudIPLC-双11秒杀】

香港CMI-Mini(三网回国优化),限时双倍升级+限时折扣!
限时折扣:仅需299元/年!

限时升级:
vCPU:1H->2H
内存:1GB->2GB
存储:10G->20G SSD
流量:300G->600G
带宽:30M->60M
活动时间:即日起-11/11

👉抢购链接(每ID限购1个):
https://www.cloudiplc.com/cart.php?a=add&pid=157

请注意!此款是个人轻量使用,流量固定,不可变更!


2025-11-07

【CloudIPLC 双11+9周年庆】

🔥 沪日IPLC-NAT秒杀款,限时返场!

返场时间:11/7-11/9 【数量有限,售完截止】
配置:1vCPU/512M/5G-SSD/50M峰值带宽
流量&价格(下单后流量不可变更):
200G方案:779/半年,1299/整年
1000G方案:4559/半年,7599/整年

👉🏻 抢购链接:
https://www.cloudiplc.com/aff.php?pid=125

🌍 线路说明:
上海入口采用更优秀的BGP多线,确保不同运营商的低延时快速访问。东京出口为NTT+BBIX
⚠️ 由于涉及内地入口,务必合法合规。严谨高风险应用例如XUI面板,因此不建议新手购买

ISIF

2025-11-07

🌟 JP.TYO-Cs.1C0G-CM
1 vCPU (EPYC) · 512MB RAM · 20GB SSD
📶 300Mbps · 流量 1000GB(大陆优化)
🌐 IPv4 ×1(大陆优化)
🌐 IPv6 /56(国际)
💰 双十一特惠:年付 €48.00 🎁

🌟 JP.TYO-Cs.1C1G-CM
1 vCPU (EPYC) · 1GB RAM · 40GB SSD
📶 1000Mbps · 流量 2000GB(大陆优化)
🌐 IPv4 ×1(大陆优化)
🌐 IPv6 /56(国际)
💰 双十一特惠:年付 €84.00 🎁

🌟 HK.HKG-Cs.1C0G-CM
1 vCPU (EPYC) · 512MB RAM · 20GB SSD
📶 300Mbps · 流量 1000GB(大陆优化)
🌐 IPv4 ×1(大陆优化)
🌐 IPv6 /56(国际)
💰 双十一特惠:年付 €48.00 🎁

🌟 HK.HKG-Cs.1C1G-CM
1 vCPU (EPYC) · 1GB RAM · 40GB SSD
📶 1000Mbps · 流量 2000GB(大陆优化)
🌐 IPv4 ×1(大陆优化)
🌐 IPv6 /56(国际)
💰 双十一特惠:年付 €84.00 🎁

Test IP:
日本 46.3.43.1
香港 46.3.39.1

🔗 立即订购 👉 https://cloud.isif.net/order/ecs - Special Products
也有常规月付款可供选购

ColoCrossing

2025-11-07

11.11 VPS offers, now available in multiple locations - New York, Los Angeles, Dallas, Chicago, Toronto

1GB KVM VPS
1GB RAM
1vCPU
30GB SSD
40TB @ 1Gbps 
1 x IPv4
$11.11/yr
https://cloud.colocrossing.com/aff.php?pid=112

2GB KVM VPS
2GB RAM
2vCPU
40GB SSD
40TB @ 1Gbps 
1 x IPv4
$16.16/yr
https://cloud.colocrossing.com/aff.php?pid=114

4GB KVM VPS
4GB RAM
3vCPU
60GB SSD
40TB @ 1Gbps 
1 x IPv4
$22.22/yr
https://cloud.colocrossing.com/aff.php?pid=126

8GB KVM VPS
8GB RAM
4vCPU
80GB SSD
40TB @ 1Gbps 
1 x IPv4
$44.44/yr
https://cloud.colocrossing.com/aff.php?pid=116
2025-11-18

https://www.colocrossing.com/black-friday

Refs: https://lowendtalk.com/discussion/211721/colocrossings-black-friday-sale-vps-from-1-97-mo-dedicated-servers-from-147-96-yr-more

RackNerd

2025-11-18

2025黑色星期五VPS促销套餐来喽!


========
1GB 套餐
========
1 GB 内存
1 CPU 核心
25 GB SSD 存储
2000 GB 月流量
1 Gbps 带宽
1个 IPv4 地址
$10.60 /年 (续费同价)
可选机房: 多机房可选
购买链接: https://my.racknerd.com/cart.php?a=add&pid=923

========
2.5GB 套餐
========
2.5 GB 内存
2 CPU 核心
45 GB SSD 存储
3000 GB 月流量
1 Gbps 带宽
1个 IPv4 地址
$18.66 /年 (续费同价)
可选机房: 多机房可选
购买链接: https://my.racknerd.com/cart.php?a=add&pid=924

========
4GB 套餐
========
4 GB 内存
3 CPU 核心
65 GB SSD 存储
6500 GB 月流量
1 Gbps 带宽
1个 IPv4 地址
$29.98 /年 (续费同价)
可选机房: 多机房可选
购买链接: https://my.racknerd.com/cart.php?a=add&pid=925

========
6GB 套餐
========
6 GB 内存
5 CPU 核心
100 GB SSD 存储
10,000 GB 月流量
1 Gbps 带宽
1个 IPv4 地址
$44.98 /年 (续费同价)
可选机房: 多机房可选
购买链接: https://my.racknerd.com/cart.php?a=add&pid=926

========
8GB 套餐
========
8 GB 内存
6 CPU 核心
150 GB SSD 存储
20,000 GB 月流量
1 Gbps 带宽
1个 IPv4 地址
$62.49 /年 (续费同价)
可选机房: 多机房可选
购买链接: https://my.racknerd.com/cart.php?a=add&pid=927

Lycheen

2025-11-20
🔥🔥 Lycheen 荔枝云 | 黑五促销

全年最低价,提供多地区 DNS 解锁,可随时订购流量包。

黑五特价产品
🇭🇰 HKv2 Lite Nano (移动快乐机):
1C / 512M / 5G / 768G流量 
价格:¥7.50/月 
🛒 [购买链接] (https://www.lycheen.com/store/2025-black-friday/hkv2-lite-nano-bf25)

🇺🇸 US-SLC Lite Tall (超级性价比):
1C / 1G / 10G / 2T流量 
价格:🔥 ¥88/年 
🛒 [购买链接] (https://www.lycheen.com/store/2025-black-friday/us-slc-lite-v1-1g-bf25) 

🇺🇸 US-SLC Lite Grand (美国大流量):
1C / 2G / 15G / 4T流量 
价格:¥13.90/月 
🛒 [购买链接] (https://www.lycheen.com/store/2025-black-friday/us-slc-lite-v1-grand-bf25)

🇪🇺 欧洲精品 DE/UK (中国优化,限量促销):
1C / 512M / 5G / 300G流量 
价格:¥10.00/月 
🇩🇪 德国 BGP [购买链接] (https://www.lycheen.com/store/2025-black-friday/de-fra-bgp-pro-mini-bf25)
🇬🇧 英国 BGP [购买链接] (https://www.lycheen.com/store/2025-black-friday/uk-lon-bgp-pro-mini-bf25)

---

📂 活动主会场:
https://www.lycheen.com/store/2025-black-friday/

🏷 全场折扣:
常规产品(不含促销/VDS/WARP)循环 7 折
优惠码: BF2025_30OFF
官网:  https://www.lycheen.com/store/

活动时间:即日起 到11月29日
支持加购 IP、建立 BGP Session、BYOIP/ASN 广播。
如有需求请提交工单咨询。

TakeHost

https://takehost.biz/manage/order/yearly-vps/1gb-yearly

✓ 1x CPU
✓ AMD EPYC 7763
✓ 1 GB RAM
✓ 50 GB NVMe SSD storage
✓ 50GBPS Port

7.90 euro a year
35 in stock

CrownCloud

CrownCloud 荷兰/德国 EPYC主机优惠,老牌商家

Plan1:
- 2v CPU(EPYC 7763)
- 6G RAM
- 60G SSD
- 3TB/10Gbps
- 1x Snapshot Slot
$5/月,$50/年
荷兰 (https://crowncloud.net/clients/index.php?rp=/store/specials/ams-6g-60g-epyc-10g-1s) / 德国 (https://crowncloud.net/clients/index.php?rp=/store/specials/fra-6g-60g-epyc-10g-1s)

Plan2:
- 4v CPU(EPYC 7763)
- 8G RAM
- 80G SSD
- 4TB/10Gbps
- 1x Snapshot Slot
$7/月,$70/年
荷兰 (https://crowncloud.net/clients/index.php?rp=/store/specials/ams-8g-80g-epyc-10g-1s) / 德国 (https://crowncloud.net/clients/index.php?rp=/store/specials/fra-8g-80g-epyc-10g-1s)

· 支持Windows(需自备许可证)
· 25端口可通过工单申请
· 可开启CPU直通,可切换VirtIO和e1000网卡
· rDNS支持
· 免费自定义ISO

测试IP:
德国:Test IPv4: 195.88.191.1 / Test IPv6: 2a0a:51c0:0:1f:291d::1
荷兰:Test IPv4: 2.58.15.20 / Test IPv6: 2a07:efc0:1001:1265::1 / Test File: https://nl2.lg.crowncloud.net/1gb.file

YINNET

2025-11-23
📢 YINNET 黑色星期五限時活動|全場 VPS 五折狂歡

尊敬的用戶您好,

為迎接黑色星期五購物季,回饋廣大用戶對 YINNET 的長期支持,我們特別推出年度重磅優惠活動:

✅ 全場 VPS 產品 5 折優惠
✅ 黑五專屬促銷代碼:BF50YINNET
✅ 每位用戶、每張訂單限使用一次
✅ 活動限時開放,售完即止,先到先得

⏰ 活動時間:即日起至黑色星期五當晚 24:00 截止

👉 登入官網於結算頁輸入優惠代碼即可享受五折優惠。

如您對產品選型、配置需求或部署方案有任何疑問,歡迎隨時聯繫客服團隊,我們將為您提供專業協助。

感謝您一直以來的支持與信任,
祝您黑五購物愉快!

YINNET 團隊 敬上

官網:https://www.yin-net.com/

TapHip

2025-11-23
黑五促销上线!HiNet 台湾家宽限时特价

📌 NAT 机型(¥99/年)
CPU:2
RAM:512MB
SSD:5GB
网络端口:300Mbps
月流量:512GB
IPv4 NAT & IPv6 × 1

📌 独享 IP(¥188/月)
CPU:4
RAM:2048MB
SSD:15GB
网络端口:500Mbps
月流量:10240GB
IPv4 × 1 + IPv6 × 1
🛒 下单地址:https://taphip.com/aff.php?gid=7

✨ 24H自动切换IP

App

Parallels

Parallels® Desktop for Mac 标准版和 Pro 版一年订阅官方打 5 折:

https://www.parallels.com/products/desktop/

Refs: https://www.appinn.com/parallels-desktop-26/

GoodSync

黑色星期五/网络星期一 促销 新的GoodSync高级订阅可享40%优惠截至12月1日。

https://www.goodsync.com/php/pums/rfprepay.php?lang=cn

UptimeRobot

UptimeRobot 黑五7折优惠
https://app.uptimerobot.com/billing/pricing/?coupon=BF25_30

Refs

彻底解决阿里云和 tailscale 冲突

作者 tl.s
2025年10月27日 23:20
如果你在一台阿里云服务器安装并启动了 tailscale,大概率会出现阿里云服务器无法上网的问题,根本原因为阿里云服务器默认DNS与tailscale网段产生冲突。

由于阿里云和 tailscale 都使用了 100.64.0.0/10 这个网段。100.64.0.0/10RFC 6598 中被保留为 运营商级 NAT (Carrier-Grade NAT) 地址段,用于 ISP 做 NAT 时避免与内网冲突。Tailscale 把它当成“只允许来自 Tailscale 接口的地址段”是符合规范的。但阿里云在 VPC 内把 100.100.2.136100.100.2.138 作为内网 DNS 服务地址,初衷是: 地址在公网不可路由,避免外泄; 与经典网络互通时不会冲突。

由于动机不同,目标不同,造成二者冲突。

目前比较流行的方法,是关闭 tailscale 的 iptables 规则生成,但这是不安全的。浏览各种解决方案后我认为脚本轮询的方案最可靠。

脚本轮训解决

/usr/local/bin/fix-ts-dns.sh

#!/bin/bash
while true; do
  if ! iptables -C ts-input -s 100.100.2.136/32 -j ACCEPT 2>/dev/null; then
    iptables -I ts-input 1 -s 100.100.2.136/32 -j ACCEPT
  fi
  sleep 30
done

/etc/systemd/system/fix-ts-dns.service

[Unit]
Description=Keep Aliyun DNS whitelist in Tailscale chain
After=tailscaled.service

[Service]
Type=simple
ExecStart=/usr/local/bin/fix-ts-dns.sh
Restart=always

[Install]
WantedBy=multi-user.target
chmod +x /usr/local/bin/fix-ts-dns.sh
systemctl enable --now fix-ts-dns.service

脚本轮询,检查到存在 tailscale 的规则表且不存在白名单时,自动插入一条放通阿里云网段的规则。

解决方案来源:https://www.xugj520.cn/archives/aliyun-cgnat-tailscale-conflict.html

Refs

2025-10-21 | 沉淀思维

作者 tl.s
2025年10月21日 22:08

焦虑的时候,还是要看书。

一章一章的读下去,让自己的思维从那些焦虑的事情上短暂移开,看到更大的世界。

焦虑是告诉人们处于不安全状态的信号,且焦虑的原因很有可能源自过去的事,低概率的事,未知的事情。弄清楚这一点,这只是传递信号,需要思考的,是更通透的看待这个世界,更深刻的思考面临的问题,更清晰的审视当前的处境。想清楚,看清楚,思考的足够深,足够远,就不会在意当下这一点小麻烦,焦虑可能就随之而去。

Reading - Unsplash

试着站在更高维度去思考和做事。多看书,多总结。

2025-10-16 | 负载高低

作者 tl.s
2025年10月16日 23:10

Oracle 云按照负载情况清退免费账户的虚拟机,这个思路感觉很厉害。其他云厂商都是听说有谁负载太高了,被限制。两种思维方式不同,可能面对的情况也有很大的区别。常规云厂商,如果单用户负载过高,影响到整体用户的体验,这就必须介入管控,这运行的每一分钟对云厂商都是金钱。但是对 Oracle 这种体量的企业来说,有严格的风控和注册限制,即使如此规模也很大,这里面难免有一批注册来不用的,这样的机器回收回去确实也合理,需要保证的是整体负载要上去,不要空置。两种思路,似乎跟现实中许多实体运转的情况,高度的类似。

Oracle - Unsplash

❌
❌