普通视图

发现新文章,点击刷新页面。
昨天以前JieJiSS' Blog

A RISC-V gcc pitfall revealed by a glibc update

作者 JieJiSS
2022年4月7日 22:48

UPDATE 2023-05-11: Finally we have builtin inline subword atomic support in GCC!

So we won't be bothered by this issue anymore.


The original blog post:

TL;DR: see "Wrap Up" and "Solution".

Problem

Recently, when we're working on a series of package rebuild pipelines triggered by a glibc update (from 2.33 to 2.34), we discovered that some previously compiling packages, mostly configured to use CMake, refuse to compile now. Their error logs look like this:

1
2
3
/usr/bin/ld: libbson-1.0.so.0.0.0: undefined reference to `__atomic_exchange_1'
/usr/bin/ld: libbson-1.0.so.0.0.0: undefined reference to `__atomic_compare_exchange_1'
collect2: error: ld returned 1 exit status

This is quite strange to us, because we used to think we have got rid of this for all by manually appending set(THREADS_PREFER_PTHREAD_FLAG ON) to these packages' CMakeLists.txt, despite replacing all -lpthread with -pthread. This was done in a per-package manner, modifying the patch to meet the need of the tech stack used by the specific package, and we are sure that they used to work before the glibc upgrade.

Before proceeding, you need to know the difference between -pthread and -lpthread, and that -pthread is the most preferable way if you want to link to the pthread library, at least for glibc<=2.33.

So, what's happening? Is there anything vital got changed in this glibc upgrade?

Cause

After reading the release note of glibc 2.34, we did notice something related:

...all functionality formerly implemented in the libraries libpthread, libdl, libutil, libanl has been integrated into libc. New applications do not need to link with -lpthread, -ldl, -lutil, -lanl anymore. For backwards compatibility, empty static archives libpthread.a, libdl.a, libutil.a, libanl.a are provided, so that the linker options keep working.

Hmm, good, sounds like we don't need -pthread anymore. Actually, it appeared to us that CMake thinks the same. After running cmake for the failing packages, we can confirm that nothing looks like -pthread is appended to either CXXFLAGS or LDFLAGS. However, if forced to compile with -pthread, the previous-mentioned link error disappears. In other words, applying this patch fixes the error.

1
2
- make
+ CFLAGS="-pthread $CFLAGS" CXXFLAGS="-pthread $CXXFLAGS" make

At this point, every clue we have gathered so far seems to indicate a bug inside cmake:

  • Other tools work fine
  • set(THREADS_PREFER_PTHREAD_FLAG ON) appears to be "broken"
  • Manually appending -pthread fixes the issue, so cmake failed to recognize that -pthread is necessary

Root Cause

Blame CMake?

So we dig into CMake's source code to see what happened. To our surprise, we didn't notice any change to the detection code it used for determining whether -pthread is necessary. CMake's detection code, at the time of writing, looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <pthread.h>
static void* test_func(void* data) {
return data;
}
int main(void) {
pthread_t thread;
pthread_create(&thread, NULL, test_func, NULL);
pthread_detach(thread);
pthread_cancel(thread);
pthread_join(thread, NULL);
pthread_atfork(NULL, NULL, NULL);
pthread_exit(NULL);
return 0;
}

It turned out that according to the glibc upgrade, this detection code now compiles without additional arguments:

1
2
3
$ gcc test_pthread.c
$ echo $?
0

As far as I can tell, this does not look like a CMake bug. The expected behavior, which is exactly what we have seen, is that if the test code can be compiled without -pthread, then the argument should not be appended to the command like (or {C,CXX}FLAGS, correspondingly). This lead to another question: why those packages are failing now, provided that the detection code is unchanged, and it used to work properly?

Dive into Source Code

Let's take the package mongo-c-driver as an example. As of version 1.21.1, we can see the code listed below in its src/libbson/src/bson/bson-atomic.h (Feel familiar with this path? Remember the libbson-1.0.so.0.0.0: undefined reference error log mentioned before?):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#define DECL_ATOMIC_STDINT(Name, VCSuffix) \
DECL_ATOMIC_INTEGRAL (Name, Name##_t, VCSuffix)

#define DECL_ATOMIC_INTEGRAL(NamePart, Type, VCIntrinSuffix) \
static BSON_INLINE Type bson_atomic_##NamePart##_fetch_add ( \
Type volatile *a, Type addend, enum bson_memory_order ord) \
{ \
DEF_ATOMIC_OP (BSON_CONCAT (_InterlockedExchangeAdd, VCIntrinSuffix), \
__atomic_fetch_add, \
__sync_fetch_and_add, \
ord, \
a, \
addend); \
} // Omitted many lines...

As we can tell from libbson's source code, it attempts to use the __atomic_* builtins of GCC, like __atomic_fetch_add and __atomic_compare_exchange. We can construct some test cases to check whether we're right:

1
2
3
4
5
// test1.c
int main() {
char u = 0, v = 1;
__atomic_exchange_n(&u, &v, __ATOMIC_RELAXED);
}
1
2
3
4
5
// test2.c
int main() {
short u = 0, v = 1;
__atomic_exchange_n(&u, &v, __ATOMIC_RELAXED);
}
1
2
3
4
5
// test4.c
int main() {
int u = 0, v = 1;
__atomic_exchange_n(&u, &v, __ATOMIC_RELAXED);
}
1
2
3
4
5
// test8.c
int main() {
long long u = 0, v = 1;
__atomic_exchange_n(&u, &v, __ATOMIC_RELAXED);
}

Here, we use the gcc built-in __atomic_* because libbson used it. Replacing this with std::atomic<T> (and compile it with g++) yields identical results correspondingly. You may also use uint*_t provided by the linux/types.h header, as their sizes are promised to be the same across all architectures.

Compile results:

1
2
3
4
5
6
7
8
9
10
$ gcc test1.c
/usr/bin/ld: /tmp/ccNst0x1.o: in function `.L0 ':
test1.c:(.text+0x34): undefined reference to `__atomic_exchange_1'
collect2: error: ld returned 1 exit status
$ gcc test2.c
/usr/bin/ld: /tmp/ccvK2EbX.o: in function `.L0 ':
test2.c:(.text+0x36): undefined reference to `__atomic_exchange_2'
collect2: error: ld returned 1 exit status
$ gcc test4.c
$ gcc test8.c

If you are kind of familiar with C++, you may have heard that codes using atomic operations might need to link libatomic (I remembered once reading this, but I can't find it now). It turned out that if we provide -latomic to gcc, the code compiles:

1
2
$ gcc test1.c -latomic
$ gcc test2.c -latomic

And this also works for -pthread:

1
2
$ gcc test1.c -pthread
$ gcc test2.c -pthread

Smart readers might have felt something unusual:

First, as we can see, gcc compiles test4.c and test8.c successfully without -latomic or -pthread, but it couldn't handle the atomic operations in test1.c and test2.c without making a call to libatomic:

1
2
3
4
5
$ gcc test1.c -latomic -S
$ gcc test4.c -S
$ cat test1.s | grep atomic
call __atomic_exchange_1@plt
$ cat test4.s | grep atomic

This is partially explained in a GitHub issue (riscv-collab/riscv-gcc#12). Till April 2022, i.e. when this blog is written, gcc does not support inlining subword atomics, and this is the reason why -latomic must be presented when invoking gcc.

But what about -pthread? Why -pthread works either? I mean, libpthread itself does not provide those atomic symbols, right? Actually, if you use ldd to check a.out generated by gcc test1.c -pthread, you will notice that libatomic is linked, instead of libpthread as we are expecting from -pthread. Changing -pthread with -lpthread won't work, so gcc must have done something internally for -pthread despite linking libpthread. In order to figure out the difference, we have to check the gcc spec:

1
2
$ gcc -dumpspecs | grep pthread
%{pthread:-lpthread} %{shared:-lc} %{!shared:%{profile:-lc_p}%{!profile:-lc}} %{pthread:--push-state --as-needed -latomic --pop-state}

Oh, the answer finally reveals: gcc silently append --as-needed -latomic if -pthread is provided. This also explains why changing -lpthread to -pthread works.

The reason why gcc decided to link libatomic when -pthread is provided had been lost in a squashed commit. My guess is that they tried to cover up the subword atomic pitfall using this strategy. (UPDATE: aswaterman's initial fix did not limit the scope to -pthread. The limitation was added one week later.) Generally, atomics are used with multi-threading, and you use pthread in such cases. However, one may use atomics without pthread, like when writing SIGNAL handlers. So IMO this workaround is not good enough, and should be replaced by subword atomic inline support.

UPDATE: We discussed this with some gcc RISC-V devs, and it turned out that when compiling gcc, at stage 1 it may not recognize libatomic (i.e. -latomic may not work) if you are compiling it with a gcc that uses the old spec So you'll have to set --with-spec at stage 1 to workaround this. Hopefully there would be an --always-link-libatomic configuration flag in the future.

Fun facts: gcc once had added --as-needed -latomic to its LIB_SPEC unconditionally, in this commit (2017-02-01):

1
2
3
+ #undef LIB_SPEC
+ #define LIB_SPEC GNU_USER_TARGET_LIB_SPEC \
+ " " LD_AS_NEEDED_OPTION " -latomic " LD_NO_AS_NEEDED_OPTION \

But later in the final port commit submitted to gcc (2017-02-07; svn r245224), they limited the change to -pthread:

1
2
-   " " LD_AS_NEEDED_OPTION " -latomic " LD_NO_AS_NEEDED_OPTION \
+ " %{pthread:" LD_AS_NEEDED_OPTION " -latomic " LD_NO_AS_NEEDED_OPTION "}"

The reason of making such limitation remains unknown.

Wrap Up

glibc moved libpthread into libc, and CMake's pthread sample code compiles directly. Hence, CMake thinks there's no need to provide -pthread when invoking gcc / g++. Since we used to patch such packages with -pthread instead of -latomic, and are actually relying on -pthread's side effect (--as-needed -latomic), this led to the consequence that neither -pthread nor -latomic is provided, so subword atomics refuse to compile as libatomic is not linked. For ways of fixing this problem, please refer to the Solution chapter.

Also note that the dependence on libatomic causes another problem. Some packages / frameworks, e.g. JUCE insists that its atomic wrap only accepts lock-free values. However, when a program is compiled, the compiler cannot know whether a call to libatomic will be lock-free or not (actually, the emitted sequence will be lock-free, but the compiler doesn't know). Hence, the compiler will return false for std::atomic<bool>::always_lock_free, and this also happens for the macro ATOMIC_BOOL_LOCK_FREE defined in atomic.h. As a result, if you use juce::Atomic<bool>, some static asserts will fail, and the package refuses to compile.

Solution

Solutions vary according to the scale of projects.

  • As a developer / user on behalf of yourself, export CFLAGS="$CFLAGS --as-needed -latomic" and likewise for CXXFLAGS solve the problem. As an alternative, you can also patch the Makefile file, and edit the flags inside it.
  • As a packager maintaining multiple packages with the same toolchain, you may want to write a module / universal patch. For projects with CMake, you can use LLVM's CheckAtomic.cmake, which (if you are using a newer version) has taken subword atomics into consideration.
  • As a developer maintaining a relatively large project that has gained some users, you may need to consider the compatibility with clang and compiler-rt. Clang should be able to handle -latomic, but better perform a double-check. You can temporarily rely on the side-effect of gcc -pthread (which is not recommended), or edit your configure script manually to disable -latomic when clang and compiler-rt are detected.
  • As a developer of a widely-used toolchain project, despite considering clang and compiler-rt, compatibility with older compilers / linkers that do not support --as-needed and even -pthread might need to be considered. CMake reverted their -pthread fix on this because a compiler named XL does not support -pthread, and interprets it as -p -t hread, returning zero (oops! false negative) as its exit code.
  • The final approach of solving this is to add subword atomic inlining support into gcc. There's a pending patch regarding this, but this patch only adds support for __atomic_fetch_add, so the problem would not be mitigated (but this patch indicates a good start!).

Complicated solutions may increase the burden on project developers / maintainers, and their willingness to port their projects to RISC-V might be reduced. Aiming to tackle the problem, we have proposed to change LIB_SPEC back to its originally patched form to mitigate this issue partially (still not lock-free), and hope that the subword atomic pitfall can be solved completely in the future.

UPDATE: After discussing this with gcc RISC-V devs, we reached the consensus that maybe a gcc configuration argument --always-link-libatomic can be added.

UPDATE2: This breaks the bootstrap process of GCC. GCC compiles itself twice to get rid of e.g. dirty compiling environments stuffs. Modifying the spec string too early causes glitches, like libatomic calls itself infinitely. So we have to use sed to replace the spec string inside the first-stage GCC binary carefully, filling blanks (U+0020) to make sure old and new spec strings have the same length.

来源:https://blog.jiejiss.com/

Setup an Arch Linux RISC-V Development Environment

作者 JieJiSS
2022年3月22日 11:04

This is a frequently updating doc, and you can find its latest version here.

This documentation aims to help you set up an Arch Linux RISC-V (riscv64gc) development environment powered by usermode QEMU and systemd-nspawn.

Currently, the documentation contains instructions for Arch Linux, Debian and Ubuntu.

Caution: If you are using a native RISC-V board, you should skip these steps, and scroll down to the bottom of this page. The following instruction is for x86_64, etc. users.

1 Prepare the Container

1.1 Install Packages

If you are using Ubuntu/Debian as the host OS, skip to "For Debian and Ubuntu".

For Arch Linux

Add [archlinuxcn] Repo Source

We need to install some packages from [archlinuxcn] later.

Append these two lines to /etc/pacman.conf:

1
2
[archlinuxcn]
Server = https://repo.archlinuxcn.org/$arch

There is also a list of public mirrors available.

After doing so, you need to trust the archlinuxcn maintainers' PGP keys to install packages from it:

1
$ sudo pacman -Sy && sudo pacman -S archlinuxcn-keyring

Install Packages

1
$ sudo pacman -S qemu-user-static qemu-user-static-binfmt

where qemu-user-static-binfmt is for registering QEMU interpreter to execute RISC-V ELF files. Other necessary packages like zstd and systemd-nspawn are listed in the dependency tree of base meta package, so they will also be installed by the provided command, hence there's no need to install them explicitly.

For Debian and Ubuntu

1
$ sudo apt install zstd qemu-user-static systemd-container

where zstd is for decompressing the Arch Linux RISC-V rootfs compressed tarball, and systemd-container is for the systemd-nspawn command, which we'll use later to spawn a container from the rootfs.

1.2 Download rootfs

1
$ curl -O https://archriscv.felixc.at/images/archriscv-20220727.tar.zst

If you have poor connectivity upon reaching archriscv.felixc.at, you can also download the rootfs from mirror sites. They are listed at https://archriscv.felixc.at/.

root user's password is sifive, but probably you don't need it if you are creating a container with this rootfs to chroot into it later.

1.3 Decompress rootfs

Arch Linux

1
2
$ mkdir archriscv
$ sudo bsdtar -xf archriscv-20220727.tar.zst -C archriscv

Debian and Ubuntu

1
2
$ mkdir archriscv
$ sudo tar -I zstd -xf archriscv-20220727.tar.zst -C archriscv

tar may spit out some warnings, which turn out to be harmless and we can safely ignore them.

2 Start and Setup a Container

1
$ sudo systemd-nspawn -D ./archriscv -a -U

where -D provides the root directory for the container, -a for preventing processes with PID 1 from not reaping zombie children, -U for preventing processes in container from using the same UID range as those used outside the container.

2.1 Check Architecture

1
2
# uname -m
riscv64

2.2 System Upgrade

1
# pacman -Syu

2.3 (Optional) Install Packages

For example, if you want to install vim:

1
# pacman -S vim

2.4 (Optional) Set Default Editor

1
# echo 'export EDITOR=vim' >> ~/.bashrc && source ~/.bashrc

2.5 (Optional) Create a Regular User for Development

1
# useradd -m <username>

where -m means to create a home directory for the user to be added.

Use visudo if you'd like to grant sudo privilege for this user, and append this line under ## User privilege specification:

1
<username> ALL=(ALL) NOPASSWD: ALL

2.6 (Optional) Switch to a Regular User

1
2
3
4
5
# exec su username
$ cd ~
$ pwd
/home/<username>
$ echo 'export EDITOR=vim' >> ~/.bashrc && source ~/.bashrc

3 Compile and Run a Program

This guide is about compile & run a program manually. If you want to start your development from a git repo or so, jump to 3.2 Compile for instructions on how to install git and other packages needed by your toolchain. If you are trying to create or modify an Arch Linux package (i.e. dealing with a PKGBUILD), you may refer to related ArchWiki pages.

3.1 Coding

Run vim hello.c and type:

1
2
3
4
5
6
#include <stdio.h>

int main(int argc, char *argv[]) {
printf("Hello RISC-V!\n");
return 0;
}

Exit with esc and :wqEnter, as this S/O answer suggested :-P

3.2 Compilation

First of all, update your local packages repo data from remote package sources (like sudo apt-get update):

1
$ sudo pacman -Sy

You should run this prior to any attempt to install a package, unless you are sure that your local copy of package info is up-to-date.

Then, install your development toolchain:

1
$ sudo pacman -S gcc cmake foo bar bah

Compile your code:

1
$ gcc -o hello hello.c

3.3 Checking for File Format

1
2
$ file hello
hello: ELF 64-bit LSB pie executable, UCB RISC-V, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-riscv64-lp64d.so.1, BuildID[sha1]=4a75c57e4e99654dca0d6dc91689dffbbe7dc581, for GNU/Linux 4.15.0, not stripped

3.4 Running the Program

1
2
$ ./hello
Hello RISC-V!

For native RISC-V board users:

Edit your /etc/pacman.conf with the following settings:

1
2
3
4
5
6
7
8
9
10
11
[core]
Server = https://archriscv.felixc.at/repo/$repo

[extra]
Server = https://archriscv.felixc.at/repo/$repo

[community]
Server = https://archriscv.felixc.at/repo/$repo

[unsupported]
Server = https://archriscv.felixc.at/repo/$repo

Finally, add the packager's public key to your keyring:

1
$ pacman-key --recv-keys "B5971F2C5C10A9A08C60030F786C63F330D7CB92"  # Always grab the latest key ID from https://archlinux.org/people/developers/#felixonmars

来源:https://blog.jiejiss.com/

解构赋值踩坑:神秘失踪的数据

作者 JieJiSS
2022年3月15日 14:00

话说在我们 PLCT ArchRV 组,基础设施都是我们自己搭出来的,分工、打标记、追踪状态全部靠 gh: XieJiSS/plct-archrv-pkg-botgh: cubercsl/archrv-pkg-notification-botgh: Ast-x64/plct-archrv-status-worker 实现。近日我们在维护标记和追踪状态时,发现 plct-archrv-pkg-bot (以下简称 bot)在重启时会丢失几条记录。

最开始,怀疑是重启的时候还有数据没写入到磁盘,导致数据丢失。但是经过排查,发现并不太可能,理论上所有的数据都是在修改时当场写入磁盘的。保险起见,增加了使用同步 IO 的 storePackageMarksSync,通过 npm: async-exit-hook 在进程退出时强制保存。然而还是丢数据。

看 log 可以确认,在退出时把内存里的数据全部写入到磁盘了,那丢数据是怎么回事呢?

又开始考虑是不是 race condition,于是通过 npm: lockfile 实现了进程的保护锁,确保同时只有一个 bot 进程在运行。除了没效果以外,效果非常好。看来也不是多进程并行写入的问题。

最后经过排查,发现问题的根源在 exportimport 上。

最小可复现样例如下:

1
2
3
4
5
6
7
8
9
10
11
12
// a.js
let A = [1, 2, 3, null];

function cleanUp() {
A = A.filter(num => typeof num === "number");
}

function showData() {
console.log(A);
}

module.exports = { A, cleanUp, showData };
1
2
3
4
5
6
7
8
// b.js
const { A, cleanup, showData } = require("./a");

A.push(4);
cleanUp();
A.push(5);
A.push(6);
showData(); // 只有 1, 2, 3, 4

可以看到,在执行 cleanUp 之后的数据全部丢失了。这是因为在 a.js 里对 A 重新赋值过后,b.js 解构赋值得到的 Aa.js 里的 A 不再指向同一个对象。解构赋值出来的 identifier A 指向的是 module.exportsA 属性在解构赋值那一时刻所指向的内存,随后执行的 A = A.filter 只是在 a.js 内部修改了 A 的指向,外部解构赋值出的 A 并不会随之更新。我们可以用 cpp 来翻译一下第一种写法:

1
2
3
const auto *A = a.A;
a.A = new vector<int>();
// 此时 A 和 a.A 显然已经不再指向同一块内存了

概括一下:如果把 JS 里面解构赋值拿到的东西当做 &ref 来理解(这是常有的事)就会导致这种 inconsistency 的出现。

如果想修复这个问题,需要在 a.js 里面把 A 声明为 const,随后将 Array#filter 替换为自己实现的 in-place filter,详见这个 commit。当然你也可以在 b.js 里始终使用 a.A 的方式来访问 A

来源:https://blog.jiejiss.com/

python-mtrpacket 在 riscv64 上编译测试不通过

作者 JieJiSS
2022年1月31日 02:03

python-mtrpacket 是使用 Python 实现的异步网络探测工具。其当前版本(1.0.0-3)在 Arch Linux 主线的 x86_64 架构能正常编译通过,在 riscv64 下报错:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
.E/bin/sh: line 1: mtr-packet-missing: command not found
..
======================================================================
ERROR: test_commands (__main__.TestCommands)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/build/python-mtrpacket/src/mtr-packet-python-1.0.0/test/test.py", line 210, in test_commands
asyncio_run(self.async_commands())
File "/build/python-mtrpacket/src/mtr-packet-python-1.0.0/test/test.py", line 218, in asyncio_run
return loop.run_until_complete(loop.create_task(coro))
File "/usr/lib/python3.10/asyncio/base_events.py", line 641, in run_until_complete
return future.result()
File "/build/python-mtrpacket/src/mtr-packet-python-1.0.0/test/test.py", line 203, in async_commands
async with mtrpacket.MtrPacket() as mtr:
File "/build/python-mtrpacket/src/mtr-packet-python-1.0.0/mtrpacket/__init__.py", line 138, in __aenter__
await self.open()
File "/build/python-mtrpacket/src/mtr-packet-python-1.0.0/mtrpacket/__init__.py", line 314, in open
if not await self.check_support('send-probe'):
File "/build/python-mtrpacket/src/mtr-packet-python-1.0.0/mtrpacket/__init__.py", line 368, in check_support
(_, args) = await self._command('check-support', check_args)
File "/build/python-mtrpacket/src/mtr-packet-python-1.0.0/mtrpacket/__init__.py", line 275, in _command
return await future
mtrpacket.ProcessError: failure to communicate with subprocess "./nc_mock.sh" (is it installed and in the PATH?)

----------------------------------------------------------------------
Ran 4 tests in 1.296s

FAILED (errors=1)

这里面 /bin/sh: line 1: mtr-packet-missing: command not found 是正常输出,属于第三个测试点,不用管它。研究下代码就能发现这个错误是预期行为,会被 self.assertRaises 抓到。E(错误)的是第二个测试点,本不应该出现 mtrpacket.ProcessError,但是出现了。

看一下 test/nc_mock.sh,发现很简单,就这么一点东西:

1
2
3
#!/bin/sh

nc localhost 8901

那这怎么就在 riscv64 上锅了呢?

首先怀疑 qemu-userargv[0] 导致找不到 nc_mock.sh,但这很容易排除。只需要在 nc_mock.sh 中加入:

1
sleep 30

然后发现测试在第二个点上卡住三十秒,就知道其实 nc_mock.sh 是被运行了的。由此可以确定:原本的报错提示具有误导性。

那么,既然运行了,怎么就报错了呢?修改一下 PKGBUILD(Arch Linux 编译包时使用的脚本),把 check() 中的 ./test.sh || bash -i,这样我们就在 test.sh(它会间接调用 nc_mock.sh)非零返回的时候得到进入到打包所用的 rootfs 环境的快捷通道。

之后通过修改 test/test.pymtrpacket/__init__.py,最终定位到出问题的地方:

首先是测试入口,在 test.py 里面创建了一个 mtr 的服务端,之后去启动对应的客户端。第二个测试点使用 nc_mock.sh 作为客户端:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
mtr_packet_executable = os.environ.get('MTR_PACKET')  # ./nc_mock.sh
if not mtr_packet_executable:
mtr_packet_executable = 'mtr-packet'

self._subprocess_name = mtr_packet_executable
self.process = await asyncio.create_subprocess_shell(
mtr_packet_executable,
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE)

self._result_task = asyncio.ensure_future(self._dispatch_results())
self._opened = True

if not await self.check_support('send-probe'):
await self.close()

这里可以看到,创建了一个 ./nc_mock.sh 的子进程。这个子进程会通过 nc 连接到 localhost8901 端口,在这个端口上监听的是 test.py 预先创建好的服务端,于是就可以走正常测试流程了。

之后是测试流程,可以看到立马执行的是 check_support('send-probe'),追进去:

1
2
3
4
5
6
7
8
9
10
11
12
13
async def check_support(self, feature: str) -> bool:
check_args = {'feature': feature}
(_, args) = await self._command('check-support', check_args)

async def _command(self, command_type: str, arguments: Dict[str, str]) -> Tuple[str, Dict[str, str]]:
# ...
command_str = token + ' ' + command_type
for argument_name in arguments:
argument_value = arguments[argument_name]
command_str += ' ' + argument_name + ' ' + argument_value
command_str += '\n'

self.process.stdin.write(command_str.encode('ascii'))

到这里,就出问题了。由于 python-mtrpacketcheckdepends 依赖是 gnu-netcat,这个 netcat 在 riscv64 上的表现和 x86_64 上不一致,会提早退出。如果将 LOCALE 设置为 itsk,还能看到对应语言(意大利语/斯洛伐克语)的错误提示。为什么只有这两种语言呢?我也不知道,反正 Package Contents 里只有这两个:

1
2
3
4
5
6
usr/share/locale/it/
usr/share/locale/it/LC_MESSAGES/
usr/share/locale/it/LC_MESSAGES/netcat.mo
usr/share/locale/sk/
usr/share/locale/sk/LC_MESSAGES/
usr/share/locale/sk/LC_MESSAGES/netcat.mo

So……反正我也不关心具体报错信息了,意大利语我也看不懂。直接把 PKGBUILD 里的 checkdependsgnu-netcat 换成 openbsd-netcat 完事,反正 ./nc_mock.sh 里的简单用法还不至于触及到两个 variant 存在不兼容的黑色高级部分。

来源:https://blog.jiejiss.com/

对于又一种对SS流式加密的攻击(奇虎团队)的详细分析

作者 JieJiSS
2020年2月12日 16:13

0x00 TL;DR

论文:这里

本来 SS/SSR 的非 AEAD 加密就已经死的差不多了,不缺这么一种攻击方式。不过攻击的思路确实很新颖。

单就这种攻击方法而言,不安全的 SS 加密方法:主要是 aes-*-cfbaes-*-ofb, aes-*-ctr;安全的 SS 加密方法主要是 chacha20aes-*-gcm*-poly1305

如何补救:将加密方法换成 aes-*-gcm,或是转战其它工具。

0x01 攻击方法详解

写在开始之前:

我写这篇博文只是想仔细剖析一种新的重放攻击的具体过程,其中内容仅作学术讨论用。

对于 SS 和 SSR 的开发、维护团队的爱恨情仇,我无意参与;大家如果心中已有定论,也不要询问我相关的内容及我本人的意见,且本篇博文中也不会涉及相关内容。

攻击方法其实真的很简单。

才怪嘞

写了一半,回来补充:尊敬的论文作者,您能不能把细节写清楚啊……给您跪了啊……

密文修改目标

流式加密的数据是被分装在不同的数据包里的,将这些数据包拼接在一块就能得到所有的密文。

密文格式如下:

1
[IV][encrypted payload]

其中的 IV 在解密 encrypted payload 时会用到,不需要操心它具体是干什么用的。

代理内容以明文 HTTP 请求为例,根据 SS 所使用的协议,将要向服务器发送的原始数据(加密前)的格式如下:

1
[target data][payload]

payload 部分储存着所有的 HTTP 请求的数据,target data 中储存着关于最终目的服务器(想要访问的网站所在的服务器)的重要信息。target data 的结构如下:

1
[type][destination address][port]

type 可能为 0x010x020x04(参见 RFC 1928type == 0x01 表示随后的 destination address 部分是个 4 字节的 IPv4 地址,也就是目标服务器的 IP 地址。port 是 2 字节的端口号。

需要知道的知识:由于 AES 流式加密的特性,明文的每个 block 在加密后出现在密文的相同位置,类似于“先入先出”概念。因此,target data 对应的密文一定会出现在第一个数据包(“初始化数据包”)中。

每个 block 的长度为 16 字节,一般与 IV(初始化向量)的长度相同,详见这个链接

如果,我们能够修改密文,使得解密后的明文的 destination address 部分储存的 IPv4 地址指向我们自己控制的服务器的 IPv4 地址,不就能够使得服务器端的 SS 将解密后的 HTTP 请求发给我的服务器了吗?

normal-procedure.jpg attacked-procedure.jpg

备注:图 2 中对于 GFW 节点的工作方式有一定的简化,与实际情况不完全一致,仅作示意用途。

从上图中可以大概了解到这种攻击的流程。攻击者在收到境外服务器发送给境内用户的可疑数据包时,将数据包复制进内存,并在一定的延时(几十毫秒到几十秒)后,将修改过密文的数据包反向发送给 ss-server,用一定的手段使 ss-server 把这个数据包解密后的明文发送给受攻击方控制的服务器。当然,由于不知道秘钥,攻击者不可能通过加密明文的方式得到他想要的密文。但是,由于 aes-*-cfbaes-*-ofb 的特性,可以通过异或密文的最开始的一个 block 来对应地修改明文的第一个 block

aes-cfb-decryption
aes-cfb-decryption

aes-*-cfb 为例,上图的解密流程可以转化为:

密文、明文下标均从 1 开始,解密函数与加密函数都是 \(\mathrm{Encrypt}\) 函数。

\(K\) 是解密所需的秘钥(\(Key\))。

\(P_i = \mathrm{Encrypt}(C_{i-1}, K)\,\,\,\mathtt{xor}\,\,C_i\),其中 \(C_0=\mathrm{IV}\)

我们不妨假设所有 ss-server 发送给 ss-local 的数据包都是 Encrypted HTTP Response(未经 TLS 加密),也即解密后的明文遵循如下格式:

1
2
3
4
5
6
HTTP 1.1 404 Not Found                 # HTTP 版本 状态码 状态文字
Date: Thu, 13 Feb 2020 06:51:25 GMT # HTTP Header,一行一个
Server: nginx
...

Response Body

然而这些数据在网络上是加密传输的,我们如何在不知道 \(K\) 的情况下获得明文呢?

答案是:ss-server\(K\),让 ss-server 帮我们解密,把明文发给我们的服务器就可以了~

换言之,我们要想办法修改 Encrypted HTTP Response,使得修改后的密文解密所得的明文遵循如下格式:

为什么是修改 Encrypted Response 而不是修改 Encrypted Request 我会在后文的适当位置解释。

1
2
3
4
5
6
[type (1 byte)][fake address (4 bytes)][port (2 bytes)]1 404 Not Found
Date: Thu, 13 Feb 2020 06:51:25 GMT # 剩余的明文不做改动,一律视为 [payload]
Server: nginx
...

Response Body

为什么是替换而不是在最前面添加内容我会在后文的适当位置解释。

Ta-da!ss-server 收到这个数据包,解密之后发现前七个字节符合 [type][address][port] 规则,认为这个数据包是 ss-local 发送给它的请求数据包,于是把解密后的明文发送到了 fake_address:port。就这样,我们在 fake_address:port 监听的程序收到了如下解密后的明文:

1
2
3
4
5
6
1 404 Not⛬졷.ꖵ�                    # 版本(残缺) 状态码 状态文字(乱码)
İ차﯐Thu, 13 Feb 2020 06:51:25 GMT # HTTP Header(最开始的部分是乱码),一行一个
Server: nginx
...

Response Body

为什么解密修改后的密文得到的明文会有乱码我会在后文的适当位置解释。

顺便说一句,这三个“为什么”,原论文里一个字都没提……我佛了……鲨了我吧……

具体密文修改方法

不难发现的是,aes-*-cfb 的加密方法会导致密文与明文逐字节一一对应,修改密文的第 \(k\) 个字节就会导致明文的第 \(k\) 个字节相应地被修改。

由于我们想要修改的是解密出的明文的第一个 block,也就是 \(P_1\),因此我们只需要考虑 \(i=1\) 的情况:

\(P_1 = \mathrm{Encrypt}(C_0, K)\,\,\,\mathtt{xor}\,\,C_1\),等价于:

\(P_1 = \mathrm{Encrypt}(\mathrm{IV}, K)\,\,\,\mathtt{xor}\,\,C_1\)

其中,\(\mathrm{IV}\) 可以从数据包的最开头(\(C_1\) 之前的部分)直接得到,\(K\) 是未知的且无法得到,但是 \(K\) 是个固定的值,因此 \(\mathrm{Encrypt}(\mathrm{IV},K)\) 也是个固定的值,设这个值为 \(E_0\)。因此,我们可以将式子改写为:

\(P_1=E_0\,\,\mathtt{xor}\,\,C_1\)

接下来的操作就很有意思了。根据异或运算的运算规律,我们可以得知:如果 \(x\,\,\mathtt{xor}\,\,y=z\),那么一定有 \(x\,\,\mathtt{xor}\,\,(y\,\,\mathtt{xor}\,\,m)=z\,\,\mathtt{xor}\,\,m\),记为性质①。因此,有:

\(P_1\,\,\mathtt{xor}\,\,m = \mathrm{Encrypt}(\mathrm{IV}, K)\,\,\,\mathtt{xor}\,\,(C_1\,\,\mathtt{xor}\,\,m)\),其中的 \(m\) 就是我们接下来要想办法构造的东西。构造目标是让 \(P_1\,\,\mathtt{xor}\,\,m\) 的值等于我们控制的服务器的 IPv4 地址。同样是根据异或运算的性质,我们可以计算出 \(m\)

\(m = P_1\,\,\mathtt{xor}\,\,\mathtt{IPv4}\)

有鉴于 \(P_1\) 的格式如下所示:

1
[type][fake address][port][other data]

备注:前文已经计算过,[type][fake address][port] 总长度为 7 字节,一定会出现在第一个 block

因此,我们其实只需要关心 \(P_1\) 的前 7 个字节。初始的 \(P_1\) 的前七个字节——根据 HTTP 协议——一定是 HTTP 1. 这七个字符,所以:

1
2
3
m = "HTTP 1.\x00\x00\x00\x00\x00\x00\x00\x00\x00" ^ [0x01][addr][port][9 bytes data]
# 一个 block 长度为 16 bytes,每次必须根据 block 长度按位异或。
# 因此需要用 9 个 \x00 补齐。

HTTP Request 的前 7 个字符是不确定的,无法计算出 \(m\)。这就是为什么一定要操作 Encrypted HTTP Response

计算出 \(m\) 之后,就可以计算出修改后的密文 \(C_1'\)\(C_1'=C_1\,\,\mathtt{xor}\,\,m\)

没想明白?再讲一遍:

\(C_1\) 解密后为 \(P_1\),由于 aes-*-cfb 的性质,\(C_1\,\,\mathtt{xor}\,\,m\) 解密后也为 \(P_1\,\,\mathtt{xor}\,\,m\)。(上文提到的性质①)

经过精心构造,\(P_1\,\,\mathtt{xor}\,\,m\) 的值是 [0x01][fake addr][port][other data],正好是我们想要的 \(P_1'\)

因此 \(P_1'\) 对应的密文 \(C_1'\) 就是 \(P_1\,\,\mathtt{xor}\,\,m\) 对应的密文 \(C_1\,\,\mathtt{xor}\,\,m\)

最后,将 \(C_1'\) 拼接回密文中:

\(C=\mathrm{IV}+C_1'+C_2+C_3+\cdots\)

得到的最终的密文直接发送给 ss-serverss-server 就会解密出如下内容了:

1
2
3
4
5
6
[0x01][fake address (4 bytes)][port (2 bytes)]1 404 Not⛬졷.ꖵ�
İ차﯐Thu, 13 Feb 2020 06:51:25 GMT # [payload]
Server: nginx
...

Response Body

接下来,ss-server 会根据 SOCKS5 协议,把从第八个字节(1)开始的 [payload] 原封不动地发给 fake_address:port,无秘钥获取明文至此成功。

Q & A

Q:为什么是修改 Encrypted HTTP Response 而不是修改 Encrypted HTTP Request?

A:不是不想,是不能。Encrypted HTTP Response 的明文前七个字节固定且已知,可以依据此计算 \(m\);Encrypted HTTP Request 的明文前七个字节可能性太多,GET url...POST url... 等,穷举成本太大。

Q:在修改密文以修改对应的明文时,为什么是替换而不是在最前面添加内容?

A:不是不想,是不能。AES 加密以块为单位,在最前面添加字符会导致整体的错位,解密出来完全变成乱码。

Q:解密修改后的密文得到的明文为什么会有乱码?

A:其实也不一定,如果是用 aes-*-ofbaes-*-ctr 加密的,就不会有乱码。再看一遍 aes-*-cfb 解密的模式图,可以发现,\(C_{i-1}\) 参与了 \(C_i\) 的解密。我们修改了 \(C_1\),因此在解密 \(C_2\) 时,会解密出乱码;但是我们没有修改 \(C_2\) 以及更往后的其它 block,因此在解密 \(C_3\) 以及更往后的其它 block 时,不会出现乱码。(补充阅读:错误扩散)

aes-*-ofb 在解密 \(C_i\) 时,不需要 \(C_{i-1}\)\(P_{i-1}\) 参与(请看这个链接),因此就不会出现乱码。

Q:如果加密前本身就不是明文(比如HTTPS),会受影响吗?

A:攻击者仍能知道你在使用 SS,但是你的数据安全和隐私不会受到威胁。攻击者仅能实现最外层的解密。

0x02 结语

还是要感慨一句,SS、SSR 占领全世界的时代已经过去了。不论是 SS 的 aes-*-*fb 还是已经弃用的 OTA,抑或者是停止维护的 SSR,显然已经不适合作为安全的选项。即使是理论上安全的 aes-*-gcm 类的 AEAD 算法,仍旧躲不过非常时期的大范围阻断(除非配合 *-obfs)。以后的世界必将是伪装类软件的天下,无论是伪装成 HTTP/HTTPS(TLS),还是伪装成 WebSocket,都已经有了较为成熟的软件和较为活跃的社区。此次新提出的攻击方法,不过是旧时代的又一声残响,仅此而已。

来源:https://blog.jiejiss.com/

在MacOS上配置dnscrypt-proxy(免分流)

作者 JieJiSS
2020年2月4日 16:34

导语

中国大陆的互联网服务提供商经常劫持部分域名,转到自己指定的 IP,以强行插入自己的广告。dnscrypt-proxy 支持 DNSCrypt 协议和 DoH 方案等,可以避免受到 DNS 污染。

然而,很多人在按照官方教程安装完 dnscrypt-proxy 后,发现国内网站访问缓慢,甚至加载失败。这是由于 dnscrypt-proxy 在解析域名时,很可能会返回域名的国外主机的 IP。这就导致数据包传输距离大大变长,甚至可能需要经过海底光缆,耗时及久、速度下降明显。本教程采用的配置在很大程度上解决了这个问题,并提出了更多的可行思路。

第零步 前置要求

  • 你的电脑上应该有正常工作的 SOCKS5 Proxy。后文中,我们均假定它工作在本机的 8886 端口。
    • 如果还有 HTTP/HTTPS Proxy,当然更好。后文中,我们均假定它工作在本机的 8887 端口。
    • 如果只有 HTTP/HTTPS Proxy,需要另外再准备一个 SOCKS5 Proxy
  • 你的电脑上的本机 53 端口没有被占用。
  • 你的电脑上安装了 curlwget。本教程采用 curl
  • 你的电脑上可以使用 nslookup 命令。
  • 你应当拥有一个管理员(root)权限的账户,并知晓它的密码。
  • 你应当可以流畅访问 GitHub

第一步 安装 dnscrypt-proxy

1.1 获得 root 权限

打开 Terminal.app(或者任何你熟悉的 shell),在其中输入并执行:

1
sudo -s

你可能需要输入你的电脑密码。输入密码时,屏幕上不会有显示,这是为了防止你的密码被身后的陌生人偷看。

执行完成后,在命令行中输入并执行:

1
whoami

如果回显为 root,说明你成功获得了 root 权限。

1.2 下载文件

点击这个链接:gh: DNSCrypt/dnscrypt-proxy/releases,下载最新版本的 dnscrypt-proxy 的二进制发布版本。文件大小大约在 1-10 MB,文件名格式如下:

1
dnscrypt-proxy-macos-版本.tar.gz

下载完成后,解压缩至任何便于记忆的目录。后文中均假定解压到 ~/dnscrypt-proxy 目录下。

解压后的文件应当包含 dnscrypt-proxyexample-dnscrypt-proxy.toml 文件。

第二步 配置 dnscrypt-proxy

2.1 初始化

首先,在命令行中切换到 dnscrypt-proxy 所在目录:

1
cd ~/dnscrypt-proxy

随后,执行如下命令:

1
2
mv ./example-dnscrypt-proxy.toml ./dnscrypt-proxy.toml
./dnscrypt-proxy

因为是初次运行,dnscrypt-proxy 会自动初始化,需要较长时间。如果等待一段时间后命令行中不再显示新的文字,并且也没有出现 error 字样,表明初始化成功。

  • 如果初始化失败,并且出现了 permission denied 字样,可以尝试执行:sudo ./dnscrypt-proxy
  • 如果初始化失败,并且出现了 address already in use 字样,请检查本机的 53 端口是否被占用:sudo lsof -i:53,并退出此命令显示的全部进程(不论 NODETCP 还是 UDP,除非你确定完全知道你在做什么)。
  • 如果初始化失败,并且出现了 Syntax error 字样,说明你下载了错误的文件,需要重新下载。

2.2 测试 dnscrypt-proxy 是否正常工作

新开启一个命令行窗口。

首先,在此窗口中执行:

1
2
cd ~/dnscrypt-proxy
./dnscrypt-proxy -resolve www.baidu.com

你应当看到 IP addresses: xxx.xxx.xxx.xxx 的字样。如果出现 IP addresses: -,说明此前的配置有问题,或你的电脑的网络连接已中断。

随后,在命令行中执行:

1
nslookup www.baidu.com 127.0.0.1

你应当看到十行左右的文字出现。

接着,打开系统设置,在右上角搜索框中输入 DNS,点击 DNS 服务器DNS servers),并在左侧的白框左下角找到 + 号。点击 + 号,在新出现的输入框中输入 127.0.0.1 并回车;用鼠标将此条记录拖拽到所有其他记录的上方,以确保它是第一条记录。点击右下角的 确定OK)按钮,待此窗口关闭后,点击其下方的设置窗口右下角的 应用Apply)按钮。

打开你常用的浏览器,开启无痕模式(或无扩展模式),尝试访问 ip.sb。你可能需要事先关闭所有全局代理。此时,网页应当正常加载。

最后,回到最开始打开的命令行窗口,按下 Control-C 组合键以停止 ./dnscrypt-proxy。随后执行:

1
2
./dnscrypt-proxy -service install
./dnscrypt-proxy -service start

这两条命令需要 root 权限。此后,均可以通过 sudo ./dnscrypt-proxy -service start 或 stop 来开始或停止 dnscrypt-proxy 服务。

2.3 使 dnscrypt-proxy 正确解析国内域名

也许此时你已经发现,dnscrypt-proxy 往往解析出海外 IP,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ nslookup music.163.com 127.0.0.1
Server:127.0.0.1
Address:127.0.0.1#53

Non-authoritative answer:
music.163.comcanonical name = music.ntes53.netease.com.
music.ntes53.netease.comcanonical name = overseasv4.music.ntes53.netease.com.
Name:overseasv4.music.ntes53.netease.com
Address: 103.126.92.133
Name:overseasv4.music.ntes53.netease.com
Address: 103.126.92.132

$ whois 103.126.92.0/24
% Information related to '103.126.92.0/22AS137263'

route: 103.126.92.0/22
origin: AS137263
descr: HONGKONG NETEASE INTERACTIVE ENTERTAINMENT LIMITED
Unit 802, 8th Floor, Chuang's Tower,
30-32 Connaught Road Central
mnt-by: MAINT-NETEASEIE-HK
last-modified: 2018-12-03T07:34:50Z
source: APNIC

在这个例子中,网易云音乐的域名被解析到香港,显然不是最优解。

最简单的解决办法是,使用代码编辑软件编辑 ~/dnscrypt-proxy/dnscrypt-proxy.toml 文件:

  • 在第 96 行附近,找到:
1
# proxy = 'socks5://127.0.0.1:9050'

并修改为(以 8886 端口为例):

1
proxy = 'socks5://127.0.0.1:8886'
  • (可选的)在第 103 行附近,找到:
1
# http_proxy = 'http://127.0.0.1:8888'

并修改为(以 8887 端口为例):

1
http_proxy = 'http://127.0.0.1:8887'

最重要的一步:

访问 dnscrypt.info/map,找到地理位置在中国大陆的圆点,单击圆点以查看其名称。在创作此教程时(2020年02月04日),中国大陆上共有三个圆点(三台可用的服务器),分别是:geekdns-southgeekdns-doh-eastgeekdns-hk

~/dnscrypt-proxy/dnscrypt-proxy.toml 文件的第 30 行附近,找到:

1
# server_names = ['scaleway-fr', 'google', 'yandex', 'cloudflare']

并修改为(以 geekdns-southgeekdns-doh-eastgeekdns-hk 为例):

1
server_names = ['geekdns-south', 'geekdns-doh-east', 'geekdns-hk']

请注意,单引号为英文半角单引号。

保存文件并关闭代码编辑器。

在命令行中执行如下命令以重启 dnscrypt-proxy 服务:

1
2
./dnscrypt-proxy -service stop
./dnscrypt-proxy -service start

再次解析域名,就可以发现,此时的解析结果已经正常,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ nslookup music.163.com 127.0.0.1
Server:127.0.0.1
Address:127.0.0.1#53

Non-authoritative answer:
music.163.comcanonical name = music.ntes53.netease.com.
music.ntes53.netease.comcanonical name = bgpv6.music.ntes53.netease.com.
Name:bgpv6.music.ntes53.netease.com
Address: 59.111.181.35
Name:bgpv6.music.ntes53.netease.com
Address: 59.111.181.60
Name:bgpv6.music.ntes53.netease.com
Address: 59.111.181.38

$ whois 59.111.181.0/24
person: Zongyi Xu
address: NetEase Building No.16 Ke Yun Road,
address: Tianhe Avenue,Guangzhou,Guangdong,China.
country: CN
phone: +86020-85106115
e-mail: jiangxu@corp.netease.com
nic-hdl: ZX3316-AP
mnt-by: MAINT-CNNIC-AP
last-modified: 2016-03-07T06:05:41Z
source: APNIC

此结果与阿里巴巴公共 DNS 服务(223.5.5.5)解析结果相同。

2.4 其它思路

2.4.1 贵富网list + 黑白名单

参见:酥酥乳.tools 第 511 篇博文

在转换出 dnsmasq_贵富网list.conf 之后,使用正则替换:将 /^server=\/([\S\s]+)\/127.0.0.1#5353 替换为 $1 127.0.0.1:53,并将结果粘贴到 dnscrypt-proxy 配置文件的 301 行左右的 Example map entries 下方,配合自建的 DNSCrypt 协议或 DoH 的本机 DNS 服务器食用。

2.4.2 在路由器上完成

dnsmasq + lede 之类的,有成熟的教程了,不再赘述。

2.4.3 使用分流脚本

传统模式,需要经常维护,但是效果不错。

gh: CNMan/dnscrypt-proxy-config

来源:https://blog.jiejiss.com/

二项式定理及其推广公式的一种简单理解办法

作者 JieJiSS
2020年1月11日 08:46

0x00 二项式定理

写在最前:\(\mathrm{C}_n^k=\binom{n}{k}=\frac{n!}{(n-k)! \cdot k!}\)

0x00.1 杨辉三角

前情提要:一种特殊的输出杨辉三角的方法

1
2
3
4
5
6
7
8
      1               (a+b)**0 = 1
/ \
1 1 (a+b)**1 = a+b
/ \ / \
1 2 1 (a+b)**2 = a² + 2ab + b²
/ \ / \ / \
1 3 3 1 (a+b)**3 = a³ + 3a²b + 3ab² + b³
... ...

不难发现,杨辉三角每一行的数字正好就是 \((a+b)^n\) 展开之后的对应多项式,而每一项的系数又和组合数 \(\binom{n}{k}\) 是相等的。

一种简单的理解是,杨辉三角中的每一个数字,表示的是从顶点开始,只向左下或右下前进,最终到达该数字所在位置的所有可能路径的总数。经过基本的推理,可以看出对于第 \(n\) 行的左起第 \(k\) 个数字,想要从顶点到达该数字所在位置都需要且只需要向左 \(n-k\) 次、向右 \(k-1\) 次。假想小明现在按照此规则从顶点出发,显然,小明每向左或向右一次,都会向下移动一行;因此,小明向下移动 \(n-1\) 行后,便会到达目标数所在的行。换言之,小明在路途上将且仅将移动 \(n-1\) 次,而这 \(n-1\) 次中有且仅有 \(n-k\) 次是向左下方移动的。于是,得出结论:这本质上是个超几何分布问题,可能性总数可以表示为 \({\rm C}_{n-1}^{n-k}={\rm C}_{n-1}^{k-1}=\binom{n-1}{k-1}\)

而与之对应的,多项式展开也是一个超几何分布问题。这是因为展开式中的 \(p\cdot a^qb^{n-q}\) 项,\(p\) 表示的是有多少个 \(a^qb^{n-q}\),而每个 \(a^qb^{n-q}\) 都是在 \((a+b)(a+b)\cdots(a+b)\) 中挑选出 \(q\) 组,令这 \(q\) 组的 \(a\) 和余下的 \(n-q\) 组中的 \(b\) 相乘所得到的,从 \(n\) 组中选出 \(q\) 组的可能性总数为 \(\binom{n}{q}\)

这样就可以推出杨辉三角的第 \(n+1\) 行全部数字是 \((a+b)^n\) 展开式的二项式系数一一对应的。

稍作拓展:

\(f(x,y)=(x+y)^n\),求 \(f(x,y)\) 的展开式二项式系数之和。

实际上,它的展开式的第 \(i+1\) 项可以直接表示为 \(\binom{n}{i}\cdot x^{n-i}y^i\),其中 \(0 \le i \le n\)。换言之,我们如果想要求其二项式系数之和 \(\sum_{i=0}^{n}\binom{n}{i}\),就可以通过将 \(x=1,\,y=1\) 代入到 \(\sum_{i=0}^{n}\binom{n}{i}\cdot x^{n-i}y^i\) 中来计算。换言之,所求即为:\(f(1,1)=2^n\)

注意到 \(2^n\) 还可以视作 \(n\) 位二进制数的总数,这仅仅是一个巧合吗?

0x01 推广公式

在那耸立的高岗黑岩之上,恶魔的秽语悄然响起:\(\sum_{i=0}^n m^i\cdot\binom{n}{i}=(m+1)^n\)

引证:\(\binom{n}{n-k}=\binom{n}{k}\),这其实很好理解。\(\binom{n}{n-k}\) 表示从 \(n\) 个东西里挑出 \(n-k\) 个,剩下 \(k\) 个;\(\binom{n}{k}\) 表示从 \(n\) 个东西里挑出 \(k\) 个,剩下 \(n-k\) 个。但是实际上,每一种挑出来的组合,必然对应且仅对应一组被剩下了的组合;所以,剩下 \(k\) 个也可以理解为挑出来了 \(k\) 个,只不过挑出来它们是为了把它们剩下。当然最直观的证明方法是用组合数定义,展开阶乘,把相同的项消掉,再重新写成阶乘、转换为组合数的表达形式。

是的,这里有一个二项式定理的推广公式,就是 \(\sum_{i=0}^n m^i\cdot\binom{n}{i}=(m+1)^n\),其中 \(m,n \in \mathbb{Z^+}\)。现在我们要谈论的是该如何去简单理解这个公式。

我们先看看 \(m=1\) 的情况,这也就是上一个标题最后所给出的情况:\(\sum_{i=0}^{n}\binom{n}{i}=2^n\)。为了简单地解释它,我们暂时还是需要用到杨辉三角:小明移动 \(n\) 次后,一定会来到第 \(n+1\) 行的某一个数所在的位置。而他想要移动 \(n\) 次,就需要做出 \(n\) 个选择,每次都必须在向左下和向右下中二选其一。所以,总的可能数为:\(2^n\),这也就是推广公式的右手侧。而左手侧所表示的正是小明走到第 \(n+1\) 行的每一个数的可能之和,所以自然也就等于 \(2^n\)

那么,对这种思考方式稍加抽象,我们可以用二进制数的思想来考虑:有一个 \(n\) 位二进制数,它的第 \(k\) 位为 \(1\) 则表示小明第 \(k\) 次选择了向左下,为 \(0\) 则表示选择了向右下。再进一步抽象,把小明完全剥离出去,就会变成:列举出所有的 \(n\) 位二进制数,这些二进制数当中,有且仅有 \(k\)\(1\) 的数的数量就是 \(\binom{n}{k}\)(这等价于从 \(n\) 位里挑出 \(k\) 位,令它们为 \(1\),其余位为 \(0\))。这种方式,在数学上更容易说明 \(m=1\) 时该式子成立。

如果 \(m > 1\) 呢?我们该怎样推广这个式子?其实也很简单,只需要对应地用 \(k\) 进制数的思想就好了。

例如,令 \(m=2\):列举出所有的 \(n\) 位三进制数,共 \(3^n\) 个。这些三进制数当中,有且仅有 \(k\) 位不是 \(0\) 的数的数量就是 \(\binom{n}{k}\)(这等价于从 \(n\) 位里挑出 \(k\) 位,令它们为 \(1\)\(2\),其余位为 \(0\))。注意到,每一个非 \(0\) 位都有两种选择:\(1\)\(2\)。所以一个 \(n\) 位三进制数中如果有 \(n-k\)\(0\),剩下的 \(k\) 位就会有 \(2^k\) 种可能性;而“ \(n\) 位三进制数中有 \(n-k\)\(0\)”本身已经有 \(\binom{n}{n-k}=\binom{n}{k}\) 种可能,二者相乘正好就是 \(\binom{n}{k}\cdot 2^k\)。于是,所有的 \(\binom{n}{k}\cdot 2^k\) 相加,应该等于\(n\) 位三进制数的总数 \(3^n\)

再往下其实也不用推了,思路是一样的。

来源:https://blog.jiejiss.com/

❌
❌