A quick intro to Zeno: Zeno is the Internet Archive’s self-described state-of-the-art WARC web archiver, and as far as I know, the only Golang WARC archiver to date.
Starting with the least meaningful metric: from June 2 (GSoC Coding officially begins) to August 31 (near the end), roughly 90 days, I opened 25 PRs to Zeno: 23 merged, 2 open. I also sent PRs to gowarc (Zeno’s WARC read/write/recording library) and gocrawlhq (Zeno’s tracker client), plus a few PRs to external dependencies.
Here are some of the more interesting bits along the way.
CSS, the myth
Regex master
As we all know, CSS can reference external URLs. For example, adding a background image via CSS:
body {
background: url(http://example.com/background.jpg);
}
Zeno parses inline CSS inside HTML and tries to extract values from URL tokens and string tokens using regex. The two simple regexes looked like this:
The biggest issue was that the second one, backgroundImageRegex, was far too permissive — it matched anything inside parentheses.
generated by regexper.com
This meant Zeno often parsed a lot of nonexistent relative paths from inline CSS. For example, in color: rgb(12, 34, 56), the 12, 34, 56 inside the parentheses got matched, causing Zeno to crawl a bunch of annoying 404 asset URLs.
How do we fix this? Write an even cleverer regex? I’d rather not become a regex grandmaster. Let’s use a real CSS parser.
A proper CSS parser should correctly handle token escaping and other housekeeping.
What in CSS actually makes network requests?
Before picking a CSS parser, I first asked: “For an archival crawler, which CSS constructs can contain useful external resources?”
So the only CSS tokens that can initiate network requests (without JS) are url() and the string token following @import.
There are two flavors of url():
The older unquoted form: url(http://example.com) – a URL token
The quoted form (single or double quotes): url("http://example.com") – a function token plus a string token
Their parsing/escaping differs. In this report, “URL token” refers to both.
CSS parser
With the spec in mind, I surveyed Go CSS parser libraries. The only one that seemed somewhat viable was https://github.com/tdewolff/parse — widely used, looks decent. But hands-on testing was a wake-up call.
It doesn’t decode token values; it’s more of a lexer/tokenizer. Not quite enough.
For instance, it can only give you the whole token like url( "http://a\"b.c" ) — you get out what you put in — but it can’t decode the value to http://a"b.c.
Other Golang CSS parsers were even less helpful.
So I wrote a small parser dedicated to extracting URL-token values, focusing on escapes, newlines, and whitespace. Then I wired it up with tdewolff/parse.
In Zeno, each URL crawl task is called an item. Every item has its own type and state.
type Item struct {
id string // ID is the unique identifier of the item
url *url.URL {
mimetype *mimetype.MIME
Hops int // This determines the number of hops this item is the result of, a hop is a "jump" from 1 page to another page
Redirects int
}
status ItemState // Status is the state of the item in the pipeline
children []*Item // Children is a slice of Item created from this item
parent *Item // Parent is the parent of the item (will be nil if the item is a seed)
}
Briefly, Hops is the page depth, and Redirects is the number of redirects followed.
Items form a tree. A root item with no parent is a seed item (a page/outlink), while all descendants are asset items (resources).
For example, if we archive archive.org, that item is the seed. On the page we discover archive.org/a.jpg as an asset; we add it to item.children. We also find archive.org/about as an outlink, so we create a separate seed item; that new seed’s hops += 1.
We know HTML has inline CSS, but it also references standalone CSS via <link rel="stylesheet" href="a.css">.
Previously, Zeno struggled on pages with separate CSS files (i.e., most pages), not only due to the lack of a proper CSS parser, but also because resources referenced inside CSS are assets of an asset — the seed item (HTML) → asset item (CSS file) → asset (e.g., images). Zeno generally doesn’t extract assets-of-assets.
My change was to integrate the parser above and then open a “backdoor” for CSS that is an asset of HTML: when item.mimetype == CSS && item.parent.mimetype == HTML, allow that asset item to extract its own assets. (We also need a backdoor for HTML → CSS → CSS via @import; see below.)
To handle CSS that uses the nesting sugar as best as possible, I donned the regex robe again and added a smarter regex fallback parser to kick in when tdewolff/parse fails.
The @import rule allows users to import style rules from other style sheets. If an @import rule refers to a valid stylesheet, user agents must treat the contents of the stylesheet as if they were written in place of the @import rule
“In place” is interesting. It made me wonder: if a page recursively @imports forever, what do browsers do? The spec doesn’t mention a depth limit. Do real browsers cap @import depth like they cap redirect chains?
I tested it: browsers don’t enforce a recursion limit for @import.
So you can even make a page that never finishes loading, lol.
To prevent a CSS @import DoS on Zeno (unlikely, but let’s be safe), I added a --max-css-jump option to limit recursion depth.
Also, the CSS spec says: when resolving ref URLs inside a separate CSS file, the base URL should be the CSS file’s URL, not the HTML document’s base. Thanks to Zeno’s item tree design, this came for free — no special handling needed in the PR.
Firefox is wrong
This section isn’t about Zeno — it’s just a quirky inconsistency I noticed while reading the CSS spec. Fun trivia.
First, look at the escape handling for string tokens.
U+005C REVERSE SOLIDUS (\)
1. If the next input code point is EOF, do nothing.
2. Otherwise, if the next input code point is a newline, consume it.
3. Otherwise, (the stream starts with a valid escape) consume an escaped code point and append the returned code point to the <string-token>’s value.
If a backslash in a string token is followed by EOF, do nothing (i.e., ignore the escape and return the token as-is).
But if a backslash is followed by a valid escape, proceed with escape handling.
If the first code point is not U+005C REVERSE SOLIDUS (\), return false.
Otherwise, if the second code point is a newline, return false.
Otherwise, return true.
So for string tokens, as long as the next code point is not EOF and not a newline, it’s considered a valid escape.
This section describes how to consume an escaped code point. It assumes that the U+005C REVERSE SOLIDUS (\) has already been consumed and that the next input code point has already been verified to be part of a valid escape. It will return a code point.
Consume the next input code point.
1. hex digit
Consume as many hex digits as possible, but no more than 5. Note that this means 1-6 hex digits have been consumed in total. If the next input code point is whitespace, consume it as well. Interpret the hex digits as a hexadecimal number. If this number is zero, or is for a surrogate, or is greater than the maximum allowed code point, return U+FFFD REPLACEMENT CHARACTER (�). Otherwise, return the code point with that value.
2. EOF
This is a parse error. Return U+FFFD REPLACEMENT CHARACTER (�).
3. anything else
Return the current input code point.
If no hex digits are consumed, it returns the current code point unchanged. If it encounters EOF, it returns U+FFFD.
Here’s the issue: in the branch with zero hex digits, this algorithm can’t encounter EOF — because the prior “valid escape” check needs a next code point, which excludes EOF.
So which rule wins in the overall tokenization? Do we follow the higher-level “Consume a token” rule (backslash + EOF in a string does nothing), or the escape rule (backslash + EOF returns U+FFFD)?
As you’ve gathered above, usable CSS lexers/parsers for Go are scarce. This was promising — and it is. I made a small performance optimization, replaced tdewolff/parse and the regex fallback with it, and it’s been working great. CSS Nesting and custom CSS variables are fine.
Two years ago, in Zeno#55, we tried to implement headless browsing using Rod.
Rod is a high-level driver for the DevTools Protocol. It’s widely used for web automation and scraping. Rod can automate most things in the browser that can be done manually.
But there was concern that the Chrome DevTools Protocol (CDP) might manipulate network data (e.g., tweak HTTP headers, transparently decompress payloads), so #55 was put on hold.
After skimming Rod’s request hijacking code, I saw it operates outside CDP. I confirmed with the Rod maintainer that an external http.Client (our gowarc client) can have full control over Chromium’s network requests.
Hijacking Requests | Respond with ctx.LoadResponse():
* As documented here: <https://go-rod.github.io/#/network/README?id=hijack-requests>
* The http.Client passed to ctx.LoadResponse() operates outside of the CDP.
* This means the http.Client has complete control over the network req/resp, allowing access to the original, unprocessed data.
* The flow is like this: browser --request-> rod ---> server ---> rod --response-> browser
Using CDP as a MITM beats general HTTP/socket-based mitmproxy approaches in one dimension: you can control requests per tab with finer granularity.
So, let’s build it.
I thought it would take a week. It took over two months, mostly ironing out details.
Lazy loading has been common since the pre-modern web, and modern sites increasingly load content dynamically.
So once a page “finishes” loading, the first thing is to scroll to trigger additional resource loads. Scrolling well is an art:
Each scroll step shouldn’t exceed the tab’s viewport height, or you’ll skip elements.
Scroll too fast and some fixed-rate animations won’t fully display, losing chances to load resources.
Scroll too slow and you waste time; headless is heavy — keeping tabs alive is costly.
It should be silky smooth.
Rather than reinventing this, I recalled webrecorder’s archiver auto-scrolls. Indeed, webrecorder/browsertrix-behaviors provides a simple heuristic scroller and many other useful behaviors (auto-play media, auto-click, etc.), all bundled as a single JS payload. Perfect — drop it in.
DOM
Traditional crawlers work with raw HTML per request.
In a browser, everything is the DOM. A tab’s DOM tree results from the server’s HTML response, the browser’s HTML normalization, JS DOM manipulation, and even your extensions.
Open a direct .mp4 URL and the browser actually creates an HTML container to play it, e.g.:
To keep Zeno’s existing outlink post-processing compatible (so we can extract outlinks from headless runs), I export the tab’s DOM as HTML and stash it in item.body, letting our current outlink extractors run with minimal changes.
nf_conntrack bites back
After finishing headless outlink crawling, I noticed that under high concurrency, new connections started failing with mysterious Connection EOFs, spiraling into death-retries. It looked like a race, and I chased it on and off for a couple of weeks. dmesg was clean; fs.file-max was fine.
By chance I discovered my upstream device’s occasional outages aligned with my headless tests…
Thanks to @olver for setting up monitoring.
gowarc doesn’t implement HTTP/1.1 keep-alive (and no HTTP/2 yet).
Zeno opens a new connection for each request.
Conntrack keeps entries around for a while after connections close.
The egress router only has 512MB RAM; Linux auto-set nf_conntrack_max to 4096.
Headless Zeno generates a lot of requests.
When the conntrack table fills up, new connections are dropped; existing ones keep working. My chat apps stayed online, so I didn’t notice.
Raising nf_conntrack_max on the router fixed it.
Future work:
I worry Zeno might also exhaust target servers’ TCP connection pools under high concurrency. The real fix is connection reuse:
Support HTTP/1.1 keep-alive.
Support HTTP/2.
HTTP caching
Chromium’s HTTP caching (RFC 9111) lives in the net stack.
We bypass Chromium’s net stack via CDP to use our own gowarc HTTP client.
As a result, every tab re-downloads cacheable resources (JS, CSS, images…). Implementing a full HTTP cache in Zeno would be a lot of work just to save bandwidth.
Fortunately, CDP exposes the resource type in request metadata. If we’ve seen an URL before and it’s an image, CSS, font, etc., we can block it. For cacheable static HTML/JS, we have to let them through for the page to load correctly.
isSeen := seencheck(item, seed, hijack.Request.URL().String())
if isSeen {
resType := hijack.Request.Type()
switch resType {
case proto.NetworkResourceTypeImage, proto.NetworkResourceTypeMedia, proto.NetworkResourceTypeFont, proto.NetworkResourceTypeStylesheet:
logger.Debug("request has been seen before and is a discardable resource. Skipping it", "type", resType)
hijack.Response.Fail(proto.NetworkErrorReasonBlockedByClient)
return
default:
logger.Debug("request has been seen before, but is not a discardable resource. Continuing with the request", "type", resType)
}
}
This isn’t “standard” like a real HTTP cache, and headful previews will look incomplete, but it’s simple, effective, and doesn’t hurt replay quality. Based on HTTP Archive’s Page Weight stats, a back-of-the-envelope estimate suggests it saves about 44% of traffic.
Future work:
Compared to other asset types, JavaScript payloads have been trending upward.
We could implement an HTTP cache for headless in Zeno to save JS bandwidth.
Chromium revision
Rod’s default Chromium revision is too old. I switched to fetching the latest snapshot revision from chromiumdash.appspot.com and downloading the matching binary.
Sometimes Google publishes a version number on chromiumdash but doesn’t build a binary for it. So, like Rod, we pinned a default revision in Zeno.
I also found some WAFs block snapshot Chromium builds but ignore distro-packaged release builds. For production, it’s best to use the distro’s Chromium.
Q: Why not use Google Chrome binaries?
A: Zeno is open source, and Google Chrome is not. It’s a mismatch.
Q: Doesn’t Google ship release builds of Chromium?
A: Only automated snapshot builds.
Content-Length and HTTP connections
When Zeno starts a crawl and free disk space drops too low, the diskWatcher signals workers to pause new items, but in-flight downloads continue until they finish.
If, at the moment of the pause, the total size of in-flight downloads exceeds --min-space-required, we’ll still run out of space — kaboom.
So I added --max-content-length. Before downloading, we check the Content-Length header; if it exceeds the cap, we skip. For streaming responses with unknown length, if the downloaded size crosses the cap, we abort.
While working on this, I found three connection-related bugs in gowarc:
A critical one: when the HTTP TCP connection closes abnormally (early EOF, I/O timeout, conn closed/reset), gowarc called .Close() instead of .CloseWithError(). The downstream mirrored MITM TCP connection mistook it for a normal EOF. For streaming responses without Content-Length, these early-EOF responses were treated as valid and written into WARC, compromising data integrity for all streaming responses. (For non-streaming responses with Content-Length, Go’s http.ReadResponse() uses io.LimitReader and checks that EOF aligns with Content-Length; if mismatched, it returns an early EOF. In other words, the stdlib masked this in most cases.)
--http-read-deadline had no effect.
On errors, temp files were sometimes not deleted.
Since Zeno defaults to Accept-Encoding: gzip, many servers return gzipped HTML/CSS/JS, often as streaming payloads. The impact was broad.
I also planned a --min-space-urgent option to abort all in-flight downloads when disk space is critically low. I got too excited fixing the bugs and forgot. Next time.
E2E tests
Web crawlers face the real internet. Zeno used to have only unit tests, with no integration or E2E tests exercising end-to-end behavior.
I’d heard Kubernetes’ E2E tests are the gold standard in Go land, but staring at https://github.com/kubernetes-sigs/e2e-framework didn’t help — I still didn’t know how to wire E2E for Zeno. How to instrument it? How to assert components behave as expected?
I came up with a log-based E2E approach. I couldn’t think of a more non-intrusive way.
During go test, spin up Zeno’s main in-process, redirect logs to a socket, and let Zeno run the test workload we feed it. The test suite connects to the socket, streams logs, and asserts lines we expect (or don’t expect).
This requires no instrumentation or mocking — logs are the probe.
To get coverage and race detector benefits over the code under test, we don’t exec a separate binary; we invoke the main entry point from a Test* function. Since go test builds all tests in the same package into one binary and runs them in one process, each E2E test must live in a different package.
Later I realized we were in one process, so we didn’t need sockets for “cross-process” comms; I switched to a simpler io.Pipe.
Future work:
Increase coverage. Since adding E2E tests, Zeno’s coverage slowly climbed from 51% to 56%. While 100% isn’t realistic for a crawler, getting to ~70% would significantly boost confidence when making changes.
Pulling your head out of the UTF‑8 sand
It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII.
If Zeno were a general-purpose crawler, we could ignore the remaining 1.2%. But as a web archiver, those legacy-encoded sites that survived into the present are valuable and charmingly retro.
Implementation was straightforward: follow WHATWG specs step by step and add tests.
The specs smell like legacy, too:
https://html.spec.whatwg.org/multipage/urls-and-fetching.html#resolving-urls
Let encoding be UTF-8.
If environment is a Document object, then set [encoding] (document charset) to environment's character encoding.
https://url.spec.whatwg.org/#query-state
1. ....
2. Percent-encode after encoding, with [encoding], buffer, and queryPercentEncodeSet, and append the result to url’s query.
https://url.spec.whatwg.org/#path-state
1. If ...special cases...
2. Otherwise, run these steps:
.... special cases ...
3. UTF-8 percent-encode c using the path percent-encode set and append the result to buffer.
If the document charset is non‑UTF‑8, then when encoding a URL: the path is UTF‑8 percent-encoded, but the query is percent-encoded using the document’s charset. Quirky? Yep.
ada is a C++ URL parser compatible with WHATWG standards. It has Go bindings: goada. Zeno uses it to normalize URLs (Go’s stdlib net/url isn’t WHATWG-compatible). But since goada is C++, building Zeno requires CGO, which complicates cross-compilation.
@otkd tried replacing goada with pure Go nlnwa/whatwg-url (Zeno#374), but I found that on non‑UTF‑8 input it bluntly replaces bytes with U+FFFD before percent-encoding, instead of percent-encoding the raw bytes.
Before normalization we can’t assume input URLs are valid UTF‑8, and for non‑UTF‑8 HTML/URLs we need the parser to percent-encode bytes as-is (as WHATWG requires), so #374 was closed.
fun fact: goada package c bindings were updated to the latest ada package as part of our exploration into using a different URL parser.
Future work:
goada is excellent quality. If only it didn’t require CGO. We could try packaging ada as WASM (e.g., like ncruces/go-sqlite3) using wazero to avoid CGO.
Misc
Lots of small PRs not worth detailing: terminal colors 🌈, sending SIGNALs to Zeno from HQ (tracker) over WebSocket, improving archiving of GitHub Issues and PR pages, etc.
What I didn’t ship (per the original proposal)
Dummy test site. When I wrote the proposal, I hadn’t figured out a good E2E approach, so I planned a small httpbin-like site to help future E2E tests. After inventing the log-based method, this became unnecessary — the test code can spin up whatever web server it needs.
Route items between a headless project HQ and a general project HQ conditionally.
Acknowledgments
Google: for GSoC — it’s a great program.
Internet Archive: thanks in many senses — Universal Access to All Knowledge!
Dr. Sawood Alam, Will Howes, Jake LaFountain: my GSoC mentors — for reviewing my PRs and sharing many useful ideas. I learned lots of nifty web tricks.
Corentin: the author of Zeno — no Zeno without him.
Members of STWP:
@OverflowCat: pointed me to other potential Go CSS parsers and fixed VS Code’s CSS variable highlighting; their blog “A New World’s Door” is full of high-tech CSS and became my testbed for Zeno’s CSS features.
@renbaoshuo: the CSS lexer port is great.
@NyaMisty: encouraged me a year ago to learn Go — opened a new world.
@Ovler: revised my GSoC proposal PDF; helped uncover the conntrack issue.
rod, goada, browsertrix-behaviors, and other libraries we depend on.
Ladybird: not yet a usable browser, but the repo is small and easy to clone. Its code serves as a reference implementation for web standards. When the spec text is confusing, reading Ladybird helps — even though I barely know C++.
type Item struct {
id string // ID is the unique identifier of the item
url *url.URL {
mimetype *mimetype.MIME
Hops int // This determines the number of hops this item is the result of, a hop is a "jump" from 1 page to another page
Redirects int
}
status ItemState // Status is the state of the item in the pipeline
children []*Item // Children is a slice of Item created from this item
parent *Item // Parent is the parent of the item (will be nil if the item is a seed)
}
The @import rule allows users to import style rules from other style sheets. If an @import rule refers to a valid stylesheet, user agents must treat the contents of the stylesheet as if they were written in place of the @import rule
哦对了,CSS 标准规定:resolve 在单独的 CSS 文件中出现的 ref URL 时,应将 CSS 文件自身的地址作为 base URL,而非 html 文档的 base URL 位置。这个由于 Zeno 的 item 树的设计,正好符合了这个特殊的要求,因此 PR 中你找不到相关的代码。
不正经的 Firefox
本节内容与 Zeno 无关,仅是一个阅读 CSS 标准时注意到的一个不一致。觉得挺有意思的。
先看 string token 的转义处理标准。
U+005C REVERSE SOLIDUS (\) 1. If the next input code point is EOF, do nothing. 2. Otherwise, if the next input code point is a newline, consume it. 3. Otherwise, (the stream starts with a valid escape) consume an escaped code point and append the returned code point to the <string-token>’s value.
If the first code point is not U+005C REVERSE SOLIDUS (), return false. Otherwise, if the second code point is a newline, return false. Otherwise, return true.
所以说对于 string token,只要下一个 code point 在不是 EOF 的基础上,还不是 newline ,就认为是有效转义。
This section describes how to consume an escaped code point. It assumes that the U+005C REVERSE SOLIDUS () has already been consumed and that the next input code point has already been verified to be part of a valid escape. It will return a code point.
Consume the next input code point.
1. hex digit Consume as many hex digits as possible, but no more than 5. Note that this means 1-6 hex digits have been consumed in total. If the next input code point is whitespace, consume it as well. Interpret the hex digits as a hexadecimal number. If this number is zero, or is for a surrogate, or is greater than the maximum allowed code point, return U+FFFD REPLACEMENT CHARACTER (�). Otherwise, return the code point with that value. 2. EOF This is a parse error. Return U+FFFD REPLACEMENT CHARACTER (�). 3. anything else Return the current input code point.
Rod is a high-level driver for DevTools Protocol. It’s widely used for web automation and scraping. Rod can automate most things in the browser that can be done manually.
我在大概浏览 Rod 的 request hijacking 代码后,Rod 的 request hijacking 功能工作在 CDP 之外。我询问了 Rod 的开发者,得到了二次确认,确实可以让外部的 http.Client (我们的 GOWARC 库)完全控制 Chromium 的网络请求。
Hijacking Requests|Respond with ctx.LoadResponse():
* As documented here: <https://go-rod.github.io/#/network/README?id=hijack-requests>
* The http.Client passed to ctx.LoadResponse() operates outside of the CDP.
* This means the http.Client has complete control over the network req/resp, allowing access to the original, unprocessed data.
* The flow is like this: browser --request-> rod ---> server ---> rod --response-> browser
isSeen := seencheck(item, seed, hijack.Request.URL().String())
if isSeen {
resType := hijack.Request.Type()
switch resType {
case proto.NetworkResourceTypeImage, proto.NetworkResourceTypeMedia, proto.NetworkResourceTypeFont, proto.NetworkResourceTypeStylesheet:
logger.Debug("request has been seen before and is a discardable resource. Skipping it", "type", resType)
hijack.Response.Fail(proto.NetworkErrorReasonBlockedByClient)
return
default:
logger.Debug("request has been seen before, but is not a discardable resource. Continuing with the request", "type", resType)
}
}
为了拿到 coverage 和让 -race 之类的 go test 功能能覆盖到被测试的主程序,所以不能 execve 主程序的二进制起新进程。需要在 Test* 函数里调用主程序的入口函数来把主程序拉起来。 由于 go test 会把同一个 package 的 _test.go 里的全部 Test 函数都编译到同一个二进制、在同一个进程里跑测试,所以需要把每个 E2E test 写到不同的 package 里。
如果 Zeno 只是个一般用途的 web crawl,我们大可忽略这仅有的 1.2% 非 UTF-8 的网页。但 Zeno 是一个 web archiver,而这些仍然使用着遗留字符编码且存活到现在的网站,任何 web archivist 都会同意它们的存档价值很高,很复古。
功能实现起来比较简单,按照 whatwg 的标准按部就班地实现并写好测试,就好了。
从 whatwg 的标准中也能闻到浓浓的浓郁的 legacy 气息:
https://html.spec.whatwg.org/multipage/urls-and-fetching.html#resolving-urls
Let encoding be UTF-8.
If environment is a Document object, then set [encoding] (document charset) to environment's character encoding.
https://url.spec.whatwg.org/#query-state
Percent-encode after encoding, with [encoding], buffer, and queryPercentEncodeSet, and append the result to url’s query.
https://url.spec.whatwg.org/#path-state
...special cases...
Otherwise, run these steps:
.... special cases ...
3. UTF-8 percent-encode c using the path percent-encode set and append the result to buffer.
(title CONTAINS 年终总结 AND (link CONTAINS ".github.io" OR link CONTAINS ".org/"))
想看diygod大佬写的,内容里包含“rss”的文章?
(author IN [diygod] AND (content CONTAINS rss))
想看某个时间段的周报?
(tags IN [周报, 日报] AND date sec(2024-01-01) TO sec(2025-01-01))
想看 CTF Writeup?
((tags IN [ctf, writeup, pwn, misc, reverse]) OR (link CONTAINS "ctf" OR link CONTAINS "writeup") OR (title CONTAINS "ctf" OR title CONTAINS "writeup"))
Say HELLO to our international friends (especially ArchiveTeam)!
A month ago, @OrIdow6 told us that he was working on a translation bridge for STWP:
[…] I’d like to bring knowledge of it to, and potentially foster collaboration with, English speakers; […] set up a unidirectional chat bridge from the STWP Telegram channels to IRC? It would be run through a machine translator […]
Now that the IRC channel is set up: #stwp-chat:hackint.org
Messages in @saveweb_chat are continuously being machine translated and forwarded to IRC.
(Messages are currently delayed by 30 minutes before being forwarded due to Telegram-side messages can be edited multiple times, and the t2bot.io public Telegram-Matrix Bridge sometimes delays and reorders messages)
Thanks to OrIdow6 for his efforts on this bridge, he spent so long tweaking it.
As a first result of the bridge connection, our box.saveweb.org RSS aggregation was discovered by ArchiveTeam guys.🙈 So, New posts in the aggregation are now ingested hourly into the #// project for archiving intime. (We don’t need to call SPN API to archive these anymore! :D)
STWP 2024 第 48 周周报
Bilibili 字幕投毒 我们发现 Bilibili 开始在视频字幕 API 里投毒。目前如果不预先访问视频详情(网页/API)或者不做 wbi 签名,字幕 API 会返回随机的驴头不对马嘴的别的视频的字幕。 投毒具体开始时间尚不清楚,至少一个月前就存在这情况了。 也就是说,我们过去一个月存的 10k 多个视频的字幕都需要消毒。
我们临时给所有成员都加上了 Private Owner 权限,Mention suggester API 真就把全部 Private Owner 列出来了,再把临时权限去掉,果然,对应账号又从 API 中消失了。(GitHub Web UI 上限制了显示行数,但 F12 能看到 API 有返回更多账号)
API 响应
正常情况下,Mention suggester 应该只会列出组织的 Public Owner 和调用处仓库的活跃维护者/贡献者。但是不知道为啥,它把本该隐藏的 Private Owner 全列出来了。