Clone
1
Sync
CodeX edited this page 2026-04-07 01:09:01 +02:00

Sync

Overview

The sync process fetches the upstream Cleanuparr blacklist, preserves any manual local additions, subtracts the locally-maintained whitelist, and writes the result back to blacklist. It runs on a schedule (every 7 days) and on manual dispatch. All logic lives in scripts/merge_blocklists.py (about 45 lines of pure Python standard library, no third-party deps).

Inputs

The script reads three sources on every run:

Source Path Role
Upstream https://raw.githubusercontent.com/Cleanuparr/Cleanuparr/main/blacklist Current upstream state, fetched over HTTPS
Upstream snapshot blacklist.prev What upstream looked like on the previous sync (baseline)
Committed blacklist blacklist Current committed state, may contain manual local additions
Whitelist whitelist Locally-maintained entries to strip from the merged result

All four are parsed the same way: one entry per non-empty line, stripped of leading/trailing whitespace, loaded into a Python set.

Three-way merge

The script performs a classic three-way merge, git-style, using set operations:

custom  = local - upstream_prev
merged  = upstream_new | custom
result  = merged - whitelist

Each line does one specific job:

custom = local - upstream_prev

Compute what was added locally. Anything in the committed blacklist that was not in the previous upstream snapshot must be a manual local addition, because the sync script is the only other thing that writes to blacklist and it always produces a subset of upstream_new | custom. Tracking this set lets the next sync re-apply those additions on top of the new upstream.

merged = upstream_new | custom

Union the fresh upstream with the preserved local additions. Upstream additions flow in (they appear in upstream_new), upstream removals flow out (they were in upstream_prev but are not in upstream_new, and are also not in custom), and manual local additions survive.

result = merged - whitelist

Strip every entry that appears in the locally-maintained whitelist. This is the step that enables local removals: an extension placed in whitelist is always removed from the final blacklist, no matter how many times upstream re-adds it.

After the merge the script writes result to blacklist and overwrites blacklist.prev with upstream_new so the next run has a fresh baseline.

Why a three-way merge

A simpler design would be result = upstream_new - whitelist, with no .prev file and no custom tracking. That works for the common case but drops an escape hatch: if you spot something upstream missed (a new malware extension, a tracker-specific junk file) and add it directly to blacklist, the next sync would silently drop it.

The three-way merge preserves those manual additions without requiring them to live in a separate "additions" file. If you never add anything directly, the custom set is empty on every run and the merge reduces to upstream_new - whitelist. The overhead is one extra file (blacklist.prev) and two set operations.

Whitelist exclusion

The whitelist is subtracted with exact-string set subtraction, not pattern matching. This has two important consequences:

Exact entries are stripped

*.srt in whitelist strips exactly *.srt from the blacklist. Same for *.webm, *.mkv, etc.

Sample patterns are preserved

The upstream blacklist contains entries like *sample.srt, *sample.webm, and *sample.mkv that block files with "sample" in the name regardless of extension. These are separate string entries from *.srt or *.webm, so whitelisting the plain extension does not remove the sample-file variant. Sample files continue to be blocked.

This is almost always the behaviour you want: subtitle files shipped inside a release are kept, but standalone "sample.srt" clutter is still filtered.

The .prev file

blacklist.prev is a plain text snapshot of whatever upstream_new was on the previous successful run. It has no special format, no metadata, and is never edited manually. The sync script rewrites it at the end of every run.

It exists purely as the baseline for the local - upstream_prev step in the three-way merge. Without it, the script could not distinguish "this entry was in local because upstream had it" from "this entry was in local because someone added it manually."

If blacklist.prev is missing (first run, or manually deleted), the script treats the current upstream_new as the baseline. This means manual additions made before the first sync are lost -- on the first run with a fresh .prev, add them to whitelist instead (where they will survive) or add them after the first sync completes.

Edge cases

First run

blacklist.prev does not exist, blacklist may or may not exist. upstream_prev = upstream_new, so custom = local - upstream_new (anything in local that is not upstream). After the run, .prev exists and subsequent runs use the normal path.

Empty or missing whitelist

If whitelist is missing or empty, whitelist = set() and the subtraction is a no-op. The merge degenerates to a plain upstream sync with local additions preserved.

Empty or missing blacklist

If blacklist is missing, local = set(), custom = set(), and result = upstream_new - whitelist. Equivalent to a fresh install.

Upstream removes an entry that is also in the whitelist

Harmless. upstream_new does not contain it, so merged does not contain it, and the whitelist subtraction removes nothing (the entry was already absent). The whitelist entry stays as a harmless no-op for future syncs.

An entry appears in both whitelist and blacklist custom additions

You manually added *.foo to blacklist and also added *.foo to whitelist. The whitelist wins: *.foo is in custom, survives the union, then gets stripped by the final subtraction. The committed blacklist will not contain *.foo. The custom entry is effectively invisible until you remove *.foo from whitelist.

Reporting

Each sync run logs four lines to the workflow output:

[blacklist] Upstream added:     [...]
[blacklist] Upstream removed:   [...]
[blacklist] Custom preserved:   [...]
[blacklist] Whitelist stripped: [...]

These are sorted lists showing exactly what changed. Check the Actions run log after any sync to see what happened, especially if a consumer reports unexpected behaviour.

Full script

import urllib.request

UPSTREAM_URL = "https://raw.githubusercontent.com/Cleanuparr/Cleanuparr/main/blacklist"
BLACKLIST = "blacklist"
BLACKLIST_PREV = "blacklist.prev"
WHITELIST = "whitelist"


def read_lines(path):
    try:
        with open(path) as f:
            return set(line.strip() for line in f if line.strip())
    except FileNotFoundError:
        return set()


def main():
    with urllib.request.urlopen(UPSTREAM_URL) as r:
        upstream_new = set(
            line.strip() for line in r.read().decode().splitlines() if line.strip()
        )

    upstream_prev = read_lines(BLACKLIST_PREV)
    if not upstream_prev:
        upstream_prev = upstream_new.copy()

    local = read_lines(BLACKLIST)
    whitelist = read_lines(WHITELIST)

    custom = local - upstream_prev
    merged = upstream_new | custom
    result = merged - whitelist

    with open(BLACKLIST, "w") as f:
        f.write("\n".join(sorted(result)) + "\n")

    with open(BLACKLIST_PREV, "w") as f:
        f.write("\n".join(sorted(upstream_new)) + "\n")

Logging and the __main__ guard are omitted above for clarity. See scripts/merge_blocklists.py in the repository for the full source.