Initial wiki: sync, lists, consumers, workflow
+238
@@ -0,0 +1,238 @@
|
||||
# CI and Workflow
|
||||
|
||||
## Overview
|
||||
|
||||
Synchronisation is driven by a single Gitea Actions workflow,
|
||||
`.gitea/workflows/sync.yml`. It runs on a schedule and on manual dispatch.
|
||||
Each run executes the sync script, commits any changes to `blacklist` or
|
||||
`blacklist.prev`, and pushes the commit to `main`.
|
||||
|
||||
There is no CI in the traditional sense for this repository -- no tests,
|
||||
no build, no lint. The workflow's only job is to keep the blacklist in
|
||||
sync with upstream.
|
||||
|
||||
## Workflow file
|
||||
|
||||
```yaml
|
||||
name: Sync blocklists from upstream
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 4 */7 * *'
|
||||
workflow_dispatch:
|
||||
|
||||
jobs:
|
||||
sync:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
|
||||
- name: Fetch and merge upstream files
|
||||
run: python3 scripts/merge_blocklists.py
|
||||
|
||||
- name: Commit and push if changed
|
||||
run: |
|
||||
git config user.name "gitea-actions"
|
||||
git config user.email "actions@gitea"
|
||||
git add .
|
||||
git diff --staged --quiet || git commit -m "Sync blocklists from upstream"
|
||||
git push
|
||||
```
|
||||
|
||||
## Schedule
|
||||
|
||||
The cron expression `0 4 */7 * *` runs at 04:00 UTC on days 1, 8, 15, 22,
|
||||
and 29 of each month -- effectively every 7 days, with a small skip at
|
||||
the end of each month because day 29 and day 1 of the next month are only
|
||||
1-3 days apart.
|
||||
|
||||
This cadence is deliberate: upstream Cleanuparr rarely changes the
|
||||
blacklist, and running less frequently reduces noise in the commit
|
||||
history. If upstream is updated and you want the change immediately,
|
||||
use manual dispatch (see below) instead of waiting for the next scheduled
|
||||
run.
|
||||
|
||||
### Changing the schedule
|
||||
|
||||
Edit the `cron` line in `.gitea/workflows/sync.yml`. Common alternatives:
|
||||
|
||||
| Cron expression | Meaning |
|
||||
|---|---|
|
||||
| `0 4 */7 * *` | Every 7 days at 04:00 UTC (current) |
|
||||
| `0 4 * * 1` | Every Monday at 04:00 UTC |
|
||||
| `0 4 1 * *` | First day of every month at 04:00 UTC |
|
||||
| `0 */6 * * *` | Every 6 hours |
|
||||
|
||||
All times are UTC. Gitea Actions does not support timezones in cron
|
||||
expressions.
|
||||
|
||||
## Manual dispatch
|
||||
|
||||
The `workflow_dispatch` trigger lets you run the sync on demand from
|
||||
the Gitea UI or via the API. Use this after editing `whitelist` if you
|
||||
want the change to take effect immediately instead of waiting for the
|
||||
next scheduled run.
|
||||
|
||||
### From the Gitea UI
|
||||
|
||||
1. Open the repository on Gitea.
|
||||
2. Go to **Actions** -> **Sync blocklists from upstream**.
|
||||
3. Click **Run workflow**.
|
||||
4. Select branch `main`.
|
||||
5. Click the confirm button.
|
||||
|
||||
The run appears in the Actions list within a few seconds and typically
|
||||
completes in under a minute.
|
||||
|
||||
### From the API
|
||||
|
||||
```bash
|
||||
curl -X POST \
|
||||
-H "Authorization: token YOUR_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"ref": "main"}' \
|
||||
https://git.hisp.no/api/v1/repos/arr/blocklists/actions/workflows/sync.yml/dispatches
|
||||
```
|
||||
|
||||
The token needs `write:repository` scope for the `arr/blocklists` repo.
|
||||
|
||||
## What the workflow does
|
||||
|
||||
### Step 1: checkout
|
||||
|
||||
Standard `actions/checkout@v3`. Checks out the repository at the current
|
||||
HEAD of `main`. No submodules, no LFS, no special configuration.
|
||||
|
||||
### Step 2: fetch and merge
|
||||
|
||||
Runs `python3 scripts/merge_blocklists.py`. The script:
|
||||
|
||||
1. Fetches the upstream blacklist from
|
||||
`https://raw.githubusercontent.com/Cleanuparr/Cleanuparr/main/blacklist`.
|
||||
2. Reads `blacklist.prev`, `blacklist`, and `whitelist` from the checked-out
|
||||
repository.
|
||||
3. Performs the three-way merge and whitelist subtraction.
|
||||
4. Writes `blacklist` and `blacklist.prev` back to disk.
|
||||
|
||||
The script is idempotent: running it twice in a row with no upstream or
|
||||
whitelist changes produces no diff on the second run.
|
||||
|
||||
See [Sync](Sync) for the full algorithm.
|
||||
|
||||
### Step 3: commit and push if changed
|
||||
|
||||
```bash
|
||||
git config user.name "gitea-actions"
|
||||
git config user.email "actions@gitea"
|
||||
git add .
|
||||
git diff --staged --quiet || git commit -m "Sync blocklists from upstream"
|
||||
git push
|
||||
```
|
||||
|
||||
This commits and pushes only if the script actually changed something.
|
||||
The `git diff --staged --quiet` check returns non-zero when there are
|
||||
staged changes, which triggers the commit via `||`. If nothing changed,
|
||||
`git commit` is skipped and the final `git push` is a no-op (push with
|
||||
no local commits ahead of the remote).
|
||||
|
||||
The commit author is always `gitea-actions <actions@gitea>`, regardless
|
||||
of who triggered the run. This makes automated syncs easy to distinguish
|
||||
from human commits in the history.
|
||||
|
||||
## Permissions
|
||||
|
||||
The workflow runs with the default `GITHUB_TOKEN` (Gitea equivalent) that
|
||||
Gitea Actions provides automatically. This token has write access to the
|
||||
repository, which is necessary for the commit-and-push step. No additional
|
||||
secrets are required.
|
||||
|
||||
No external API tokens are needed -- the upstream blacklist is fetched
|
||||
from a public raw URL on `raw.githubusercontent.com` without
|
||||
authentication.
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Checking recent runs
|
||||
|
||||
Go to **Actions** -> **Sync blocklists from upstream** in the Gitea UI.
|
||||
Each run shows:
|
||||
|
||||
- Status (success / failure)
|
||||
- Trigger (schedule / manual dispatch)
|
||||
- Commit created (if any)
|
||||
- Full log output
|
||||
|
||||
### Reading the log
|
||||
|
||||
The Python script prints four summary lines per run. These appear in
|
||||
the "Fetch and merge upstream files" step log:
|
||||
|
||||
```
|
||||
[blacklist] Upstream added: [...]
|
||||
[blacklist] Upstream removed: [...]
|
||||
[blacklist] Custom preserved: [...]
|
||||
[blacklist] Whitelist stripped: [...]
|
||||
```
|
||||
|
||||
Use these to verify the sync behaved as expected. "Whitelist stripped"
|
||||
should list every entry in your whitelist that was present in the upstream
|
||||
blacklist at fetch time.
|
||||
|
||||
### Run history in git log
|
||||
|
||||
Every automated commit uses the same message, so filtering the history
|
||||
is easy:
|
||||
|
||||
```bash
|
||||
git log --author="gitea-actions" --oneline
|
||||
```
|
||||
|
||||
Or to see commits that actually touched the blacklist:
|
||||
|
||||
```bash
|
||||
git log --oneline -- blacklist
|
||||
```
|
||||
|
||||
## Failure modes
|
||||
|
||||
### Upstream unreachable
|
||||
|
||||
If `raw.githubusercontent.com` is unreachable or returns a non-200
|
||||
response, `urllib.request.urlopen` raises an exception and the script
|
||||
exits non-zero. The workflow fails at the "Fetch and merge upstream
|
||||
files" step. No commit is made, no push happens. The repository state
|
||||
is unchanged.
|
||||
|
||||
Retry the workflow manually once upstream is available again.
|
||||
|
||||
### Script error
|
||||
|
||||
If the sync script crashes (malformed upstream, disk full, etc.), the
|
||||
step fails and no commit is made. Read the full step log to diagnose.
|
||||
|
||||
### Push rejected
|
||||
|
||||
If someone pushes to `main` between the checkout and the push, the push
|
||||
is rejected (non-fast-forward). The workflow fails at the push step.
|
||||
No data is lost -- the next scheduled run will fetch the latest state
|
||||
and re-apply the sync.
|
||||
|
||||
### Commit is empty
|
||||
|
||||
This is not a failure. The `git diff --staged --quiet || git commit`
|
||||
pattern explicitly skips the commit when nothing changed, and the
|
||||
subsequent `git push` is a no-op. The workflow reports success.
|
||||
|
||||
## Disabling the scheduled run
|
||||
|
||||
To pause automatic syncing without removing the workflow entirely,
|
||||
comment out the `schedule` section in `.gitea/workflows/sync.yml`:
|
||||
|
||||
```yaml
|
||||
on:
|
||||
# schedule:
|
||||
# - cron: '0 4 */7 * *'
|
||||
workflow_dispatch:
|
||||
```
|
||||
|
||||
Manual dispatch still works. Uncomment to re-enable scheduling.
|
||||
+181
@@ -0,0 +1,181 @@
|
||||
# Consumers
|
||||
|
||||
The blocklists are consumed by two tools in the ARR stack:
|
||||
|
||||
| Tool | Role | File consumed | Mode |
|
||||
|---|---|---|---|
|
||||
| qBittorrent | Download client | `blacklist` | Excluded file names |
|
||||
| Cleanuparr | Media cleanup / malware blocker | `blacklist` or `whitelist` | Blacklist or whitelist mode |
|
||||
|
||||
Both tools read a remote text file over HTTPS, one glob pattern per line.
|
||||
They refresh on their own schedule (qBittorrent on restart or manual
|
||||
refresh; Cleanuparr on its configured interval).
|
||||
|
||||
## Raw URLs
|
||||
|
||||
Point consumers at the raw file URLs, not the Gitea blob viewer URLs:
|
||||
|
||||
```
|
||||
https://git.hisp.no/arr/blocklists/raw/branch/main/blacklist
|
||||
https://git.hisp.no/arr/blocklists/raw/branch/main/whitelist
|
||||
```
|
||||
|
||||
The `raw/branch/main/` path serves the file contents directly with the
|
||||
correct `text/plain` content type. Using `src/branch/main/` instead serves
|
||||
the HTML viewer page and will break the consumer.
|
||||
|
||||
## qBittorrent
|
||||
|
||||
qBittorrent has an **excluded file names** feature that skips files
|
||||
matching any of the configured glob patterns when downloading a torrent.
|
||||
There is no "included file names" or whitelist mode -- qBittorrent only
|
||||
supports exclusion. This is why it consumes the merged `blacklist` and not
|
||||
the `whitelist`.
|
||||
|
||||
### Configuration
|
||||
|
||||
1. Open **Options** (Tools -> Options, or Ctrl+,).
|
||||
2. Go to **Downloads**.
|
||||
3. Scroll to **Excluded file names**.
|
||||
4. Enable the checkbox.
|
||||
5. Set the URL to:
|
||||
|
||||
```
|
||||
https://git.hisp.no/arr/blocklists/raw/branch/main/blacklist
|
||||
```
|
||||
|
||||
qBittorrent fetches the list on startup and whenever you click **Reload**
|
||||
next to the field. There is no automatic refresh interval -- a restart or
|
||||
manual reload is required to pick up changes.
|
||||
|
||||
### What qBittorrent does with the list
|
||||
|
||||
When a torrent is added, qBittorrent iterates the files inside it and
|
||||
checks each filename against the excluded patterns. Matching files are
|
||||
marked as "do not download" and will not be written to disk. The rest of
|
||||
the torrent downloads normally.
|
||||
|
||||
This means the list operates at the **file level within a torrent**, not
|
||||
the torrent level. A torrent containing `movie.mkv` and `movie.nor.srt`
|
||||
would download both files if `*.srt` is in the whitelist (and thus not in
|
||||
the blacklist), or just `movie.mkv` if `*.srt` were in the blacklist.
|
||||
|
||||
### Refreshing after a whitelist change
|
||||
|
||||
qBittorrent does not auto-refresh the list. After updating `whitelist`:
|
||||
|
||||
1. Wait for the next sync run (or dispatch the workflow manually).
|
||||
2. In qBittorrent, open the excluded file names setting and click
|
||||
**Reload**, or restart qBittorrent.
|
||||
3. New torrents added from this point on will use the updated list.
|
||||
Torrents already in the client are not retroactively changed.
|
||||
|
||||
## Cleanuparr
|
||||
|
||||
Cleanuparr supports two modes for its Malware Blocker and Blacklist Sync
|
||||
features. The repository provides files suitable for both.
|
||||
|
||||
### Blacklist mode
|
||||
|
||||
In blacklist mode, Cleanuparr deletes any file matching a pattern in the
|
||||
configured list.
|
||||
|
||||
Point it at the same URL as qBittorrent:
|
||||
|
||||
```
|
||||
https://git.hisp.no/arr/blocklists/raw/branch/main/blacklist
|
||||
```
|
||||
|
||||
Because the whitelist has already been subtracted, this file will not
|
||||
cause Cleanuparr to delete anything you have marked as "keep" in the
|
||||
whitelist. Consistent behaviour between the two tools without any
|
||||
per-tool customisation.
|
||||
|
||||
### Whitelist mode
|
||||
|
||||
In whitelist mode, Cleanuparr keeps only files matching a pattern in the
|
||||
configured list and deletes everything else.
|
||||
|
||||
Point it at:
|
||||
|
||||
```
|
||||
https://git.hisp.no/arr/blocklists/raw/branch/main/whitelist
|
||||
```
|
||||
|
||||
This is the more conservative choice: only the extensions explicitly
|
||||
listed (video containers and subtitles) are allowed. Anything else --
|
||||
including extensions that upstream has not yet flagged as malicious --
|
||||
is deleted.
|
||||
|
||||
### Which mode to use
|
||||
|
||||
| Use case | Mode | Why |
|
||||
|---|---|---|
|
||||
| You trust upstream Cleanuparr's coverage and want to keep everything except known-bad | Blacklist | Lets through unusual-but-legitimate file types (e.g. exotic subtitle formats) |
|
||||
| You only want a strict set of video + subtitle files on disk | Whitelist | Much stricter; deletes anything not explicitly listed |
|
||||
| You want behaviour consistent with qBittorrent | Blacklist | Same source file, same semantics |
|
||||
|
||||
Blacklist mode is the recommended default because it matches the
|
||||
qBittorrent side and avoids unexpected deletions of legitimate but
|
||||
non-listed files.
|
||||
|
||||
## Keeping both consumers in sync
|
||||
|
||||
Both consumers ultimately read the whitelist (directly in Cleanuparr
|
||||
whitelist mode, indirectly via subtraction in blacklist mode and in
|
||||
qBittorrent). This means maintenance is centralised:
|
||||
|
||||
1. Add a line to `whitelist`.
|
||||
2. Wait for the next sync run (or dispatch manually).
|
||||
3. Both consumers honour the change after their next refresh.
|
||||
|
||||
There is no per-tool configuration drift because there is no per-tool
|
||||
configuration to drift.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### A file I whitelisted is still being blocked / deleted
|
||||
|
||||
Check each layer in order:
|
||||
|
||||
1. **Sync ran successfully?** Open the Gitea Actions page for the
|
||||
repository and verify the most recent run is green and newer than
|
||||
your whitelist commit.
|
||||
2. **Blacklist was updated?** Read `blacklist` in Gitea and confirm your
|
||||
whitelisted entry is not present.
|
||||
3. **Consumer refreshed?** qBittorrent requires a manual reload or
|
||||
restart. Cleanuparr refreshes on its own interval -- check its logs
|
||||
to confirm it picked up the new file.
|
||||
4. **Exact string match?** Whitelist entries must match blacklist entries
|
||||
exactly. `*.srt` in whitelist does not strip `*sample.srt` from
|
||||
blacklist. See [Lists](Lists) for pattern semantics.
|
||||
|
||||
### A file I did not whitelist is passing through
|
||||
|
||||
Check whether the pattern is in the blacklist at all:
|
||||
|
||||
1. Open `blacklist` in Gitea and search for the extension.
|
||||
2. If it is not there, upstream does not block it either. You can add
|
||||
it to `blacklist` directly (manual local addition, preserved by the
|
||||
three-way merge) or file an upstream issue.
|
||||
|
||||
### Consumer returns 404
|
||||
|
||||
Verify the URL uses `raw/branch/main/`, not `src/branch/main/`:
|
||||
|
||||
```
|
||||
# Correct
|
||||
https://git.hisp.no/arr/blocklists/raw/branch/main/blacklist
|
||||
|
||||
# Wrong (serves HTML, not the file)
|
||||
https://git.hisp.no/arr/blocklists/src/branch/main/blacklist
|
||||
```
|
||||
|
||||
Also check the repository name and branch are correct
|
||||
(`arr/blocklists`, `main`).
|
||||
|
||||
### Cleanuparr deletes subtitle files
|
||||
|
||||
Cleanuparr is running in whitelist mode against `blacklist`, which is
|
||||
the wrong combination. Either switch it to blacklist mode (keep the URL),
|
||||
or keep whitelist mode and point it at `whitelist` instead.
|
||||
-1
@@ -1 +0,0 @@
|
||||
Welcome to the Wiki.
|
||||
+206
@@ -0,0 +1,206 @@
|
||||
# Lists
|
||||
|
||||
## The two-file model
|
||||
|
||||
The repository contains exactly two data files. Each has a single, clear
|
||||
role:
|
||||
|
||||
| File | Role | Source of truth | Edit it? |
|
||||
|---|---|---|---|
|
||||
| `blacklist` | Extensions blocked by downloaders and file cleaners | Upstream Cleanuparr, minus `whitelist` | Only for manual additions that upstream missed. Removals do not stick -- use `whitelist` instead |
|
||||
| `whitelist` | Extensions that must never be blocked or deleted | Locally maintained, not synced from upstream | Yes. This is the main file you interact with |
|
||||
|
||||
`blacklist.prev` also exists in the repo but is not a data file -- it is
|
||||
the three-way merge baseline used by the sync script. Never edit it.
|
||||
|
||||
## `blacklist`
|
||||
|
||||
The blacklist is the output file consumed by qBittorrent and (optionally)
|
||||
Cleanuparr. It is regenerated on every sync as:
|
||||
|
||||
```
|
||||
upstream_new | custom_local_additions - whitelist
|
||||
```
|
||||
|
||||
Where `custom_local_additions` is detected by comparing the committed
|
||||
`blacklist` against the previous upstream snapshot. See
|
||||
[Sync](Sync) for the full algorithm.
|
||||
|
||||
### When to edit `blacklist` directly
|
||||
|
||||
In almost every case, you do not. The intended workflow is:
|
||||
|
||||
- To **remove** an entry (stop blocking it): add it to `whitelist`.
|
||||
- To **add** an entry that upstream should also have: file an upstream
|
||||
issue with Cleanuparr.
|
||||
- To **add** an entry that is specific to your setup and not worth
|
||||
upstreaming: edit `blacklist` directly. The three-way merge preserves
|
||||
manual additions across syncs.
|
||||
|
||||
### When editing `blacklist` directly does not work
|
||||
|
||||
Removing a line from `blacklist` does not work as a removal mechanism.
|
||||
The sync will re-add anything upstream has on the next run. If you want
|
||||
something gone, put it in `whitelist`.
|
||||
|
||||
## `whitelist`
|
||||
|
||||
The whitelist is the locally-maintained allow list. It is the single source
|
||||
of truth for "what must be kept." It is not synced from upstream -- any
|
||||
changes you make are permanent until you change them again.
|
||||
|
||||
### Format
|
||||
|
||||
One glob pattern per line, sorted, no blank lines, no comments:
|
||||
|
||||
```
|
||||
*.ass
|
||||
*.avi
|
||||
*.mkv
|
||||
*.mp4
|
||||
*.srt
|
||||
*.ssa
|
||||
*.sub
|
||||
*.webm
|
||||
```
|
||||
|
||||
The sort order is not enforced by the script but is the convention and
|
||||
makes diffs easier to read.
|
||||
|
||||
### Semantics
|
||||
|
||||
Each line is treated as an exact string and subtracted from the blacklist.
|
||||
See [Pattern matching](#pattern-matching) below for the details.
|
||||
|
||||
### Adding an entry
|
||||
|
||||
Edit `whitelist` in Gitea (or via a local clone and push), add the new
|
||||
line, commit. The next sync run (or manual dispatch) will strip it from
|
||||
the blacklist automatically.
|
||||
|
||||
You do not also need to remove it from the blacklist by hand -- the sync
|
||||
does that.
|
||||
|
||||
### Removing an entry
|
||||
|
||||
Delete the line from `whitelist` and commit. The next sync will re-add
|
||||
the entry to the blacklist if upstream still has it. If upstream no longer
|
||||
has the entry, the entry stays gone (which is probably what you want).
|
||||
|
||||
## Pattern matching
|
||||
|
||||
The whitelist-to-blacklist exclusion uses **exact-string set subtraction**,
|
||||
not glob matching. This is an intentional design choice that has two
|
||||
important consequences.
|
||||
|
||||
### Exact entries are stripped
|
||||
|
||||
`*.srt` in the whitelist removes exactly the string `*.srt` from the
|
||||
blacklist. If upstream has `*.srt` as a line, it gets removed. If upstream
|
||||
does not have `*.srt`, nothing happens.
|
||||
|
||||
### Partial matches are not affected
|
||||
|
||||
`*.srt` in the whitelist does **not** strip:
|
||||
|
||||
| Blacklist entry | Stripped? | Why |
|
||||
|---|---|---|
|
||||
| `*.srt` | yes | Identical string |
|
||||
| `*sample.srt` | no | Different string |
|
||||
| `*.srt.bak` | no | Different string |
|
||||
| `file.srt` | no | Different string |
|
||||
|
||||
This is what makes the whitelist safe to maintain. You can whitelist
|
||||
`*.srt` to keep bundled subtitle files without accidentally unblocking
|
||||
sample files or junk variants that happen to end in `.srt`.
|
||||
|
||||
### Why not glob matching
|
||||
|
||||
A glob-based exclusion would strip anything matching `*.srt` as a pattern,
|
||||
which would also strip `*sample.srt` and `*.srt.bak`. That is usually not
|
||||
what you want -- sample files are legitimate junk that the blacklist
|
||||
should still remove.
|
||||
|
||||
Exact-string subtraction is also trivially simple to reason about: if the
|
||||
line you want stripped is in the blacklist as the exact same string, put
|
||||
that same string in the whitelist. Done.
|
||||
|
||||
## Examples
|
||||
|
||||
### Keeping Norwegian subtitle files
|
||||
|
||||
Scenario: torrents include `.srt` files as bundled Norwegian subtitles.
|
||||
You want qBittorrent to download them, not strip them.
|
||||
|
||||
```
|
||||
# whitelist entry
|
||||
*.srt
|
||||
```
|
||||
|
||||
After the next sync, `*.srt` is gone from `blacklist`. qBittorrent now
|
||||
accepts `.srt` files from within torrents. `*sample.srt` remains blocked.
|
||||
|
||||
### Supporting AV1 in `.webm` containers
|
||||
|
||||
Scenario: you want qBittorrent to accept `.webm` AV1 releases, which are
|
||||
currently blocked because the upstream blacklist treats `*.webm` as junk.
|
||||
|
||||
```
|
||||
# whitelist entry
|
||||
*.webm
|
||||
```
|
||||
|
||||
After the next sync, `*.webm` is gone from `blacklist`. `.webm` torrents
|
||||
download normally. `*sample.webm` remains blocked.
|
||||
|
||||
### Adding a site-specific junk extension
|
||||
|
||||
Scenario: a private tracker keeps injecting `*.nfo.gz` spam files that
|
||||
upstream does not block.
|
||||
|
||||
```
|
||||
# Edit blacklist directly, add the line:
|
||||
*.nfo.gz
|
||||
```
|
||||
|
||||
Commit and push. The next sync runs, the three-way merge sees
|
||||
`*.nfo.gz` in `local - upstream_prev`, classifies it as a manual addition,
|
||||
and preserves it through the merge. Subsequent syncs continue to preserve
|
||||
it even as upstream evolves.
|
||||
|
||||
If upstream ever adds `*.nfo.gz` itself, the entry moves from "custom"
|
||||
to "upstream" on the next sync -- still present, still blocked, just
|
||||
sourced differently.
|
||||
|
||||
## What lives in each file right now
|
||||
|
||||
The whitelist ships with the extensions required for a normal media
|
||||
stack with subtitles and AV1/webm releases:
|
||||
|
||||
```
|
||||
*.ass - SubStation Alpha subtitles
|
||||
*.avi - Audio Video Interleave container
|
||||
*.mkv - Matroska container
|
||||
*.mp4 - MPEG-4 container
|
||||
*.srt - SubRip subtitles
|
||||
*.ssa - SubStation Alpha subtitles
|
||||
*.sub - MicroDVD / VobSub subtitles
|
||||
*.webm - WebM container (AV1, VP9)
|
||||
```
|
||||
|
||||
The blacklist contains whatever upstream Cleanuparr ships, minus everything
|
||||
in the whitelist above. The actual contents change as upstream evolves --
|
||||
check the file in Gitea for the current state.
|
||||
|
||||
## Consumer consequences
|
||||
|
||||
Changes to either file affect what consumers see:
|
||||
|
||||
| Change | Effect on qBittorrent | Effect on Cleanuparr (blacklist mode) | Effect on Cleanuparr (whitelist mode) |
|
||||
|---|---|---|---|
|
||||
| Add to `whitelist` | Stops blocking this extension | Stops deleting this extension | Starts allowing this extension |
|
||||
| Remove from `whitelist` | Resumes blocking (if upstream has it) | Resumes deleting (if upstream has it) | Stops allowing this extension |
|
||||
| Add to `blacklist` directly | Starts blocking this extension | Starts deleting this extension | No effect |
|
||||
| Remove from `blacklist` directly | No effect (sync re-adds) | No effect (sync re-adds) | No effect |
|
||||
|
||||
See [Consumers](Consumers) for configuration details.
|
||||
+208
@@ -0,0 +1,208 @@
|
||||
# Sync
|
||||
|
||||
## Overview
|
||||
|
||||
The sync process fetches the upstream Cleanuparr blacklist, preserves any
|
||||
manual local additions, subtracts the locally-maintained whitelist, and
|
||||
writes the result back to `blacklist`. It runs on a schedule (every 7 days)
|
||||
and on manual dispatch. All logic lives in `scripts/merge_blocklists.py`
|
||||
(about 45 lines of pure Python standard library, no third-party deps).
|
||||
|
||||
## Inputs
|
||||
|
||||
The script reads three sources on every run:
|
||||
|
||||
| Source | Path | Role |
|
||||
|---|---|---|
|
||||
| Upstream | `https://raw.githubusercontent.com/Cleanuparr/Cleanuparr/main/blacklist` | Current upstream state, fetched over HTTPS |
|
||||
| Upstream snapshot | `blacklist.prev` | What upstream looked like on the previous sync (baseline) |
|
||||
| Committed blacklist | `blacklist` | Current committed state, may contain manual local additions |
|
||||
| Whitelist | `whitelist` | Locally-maintained entries to strip from the merged result |
|
||||
|
||||
All four are parsed the same way: one entry per non-empty line, stripped of
|
||||
leading/trailing whitespace, loaded into a Python `set`.
|
||||
|
||||
## Three-way merge
|
||||
|
||||
The script performs a classic three-way merge, git-style, using set
|
||||
operations:
|
||||
|
||||
```
|
||||
custom = local - upstream_prev
|
||||
merged = upstream_new | custom
|
||||
result = merged - whitelist
|
||||
```
|
||||
|
||||
Each line does one specific job:
|
||||
|
||||
### `custom = local - upstream_prev`
|
||||
|
||||
Compute what was added locally. Anything in the committed `blacklist` that
|
||||
was not in the previous upstream snapshot must be a manual local addition,
|
||||
because the sync script is the only other thing that writes to `blacklist`
|
||||
and it always produces a subset of `upstream_new | custom`. Tracking this
|
||||
set lets the next sync re-apply those additions on top of the new upstream.
|
||||
|
||||
### `merged = upstream_new | custom`
|
||||
|
||||
Union the fresh upstream with the preserved local additions. Upstream
|
||||
additions flow in (they appear in `upstream_new`), upstream removals flow
|
||||
out (they were in `upstream_prev` but are not in `upstream_new`, and are
|
||||
also not in `custom`), and manual local additions survive.
|
||||
|
||||
### `result = merged - whitelist`
|
||||
|
||||
Strip every entry that appears in the locally-maintained whitelist. This
|
||||
is the step that enables local removals: an extension placed in `whitelist`
|
||||
is always removed from the final `blacklist`, no matter how many times
|
||||
upstream re-adds it.
|
||||
|
||||
After the merge the script writes `result` to `blacklist` and overwrites
|
||||
`blacklist.prev` with `upstream_new` so the next run has a fresh baseline.
|
||||
|
||||
## Why a three-way merge
|
||||
|
||||
A simpler design would be `result = upstream_new - whitelist`, with no
|
||||
`.prev` file and no custom tracking. That works for the common case but
|
||||
drops an escape hatch: if you spot something upstream missed (a new
|
||||
malware extension, a tracker-specific junk file) and add it directly to
|
||||
`blacklist`, the next sync would silently drop it.
|
||||
|
||||
The three-way merge preserves those manual additions without requiring
|
||||
them to live in a separate "additions" file. If you never add anything
|
||||
directly, the `custom` set is empty on every run and the merge reduces to
|
||||
`upstream_new - whitelist`. The overhead is one extra file (`blacklist.prev`)
|
||||
and two set operations.
|
||||
|
||||
## Whitelist exclusion
|
||||
|
||||
The whitelist is subtracted with exact-string set subtraction, not pattern
|
||||
matching. This has two important consequences:
|
||||
|
||||
### Exact entries are stripped
|
||||
|
||||
`*.srt` in `whitelist` strips exactly `*.srt` from the blacklist. Same for
|
||||
`*.webm`, `*.mkv`, etc.
|
||||
|
||||
### Sample patterns are preserved
|
||||
|
||||
The upstream blacklist contains entries like `*sample.srt`, `*sample.webm`,
|
||||
and `*sample.mkv` that block files with "sample" in the name regardless of
|
||||
extension. These are separate string entries from `*.srt` or `*.webm`, so
|
||||
whitelisting the plain extension does not remove the sample-file variant.
|
||||
Sample files continue to be blocked.
|
||||
|
||||
This is almost always the behaviour you want: subtitle files shipped inside
|
||||
a release are kept, but standalone "sample.srt" clutter is still filtered.
|
||||
|
||||
## The `.prev` file
|
||||
|
||||
`blacklist.prev` is a plain text snapshot of whatever `upstream_new` was on
|
||||
the previous successful run. It has no special format, no metadata, and is
|
||||
never edited manually. The sync script rewrites it at the end of every run.
|
||||
|
||||
It exists purely as the baseline for the `local - upstream_prev` step in
|
||||
the three-way merge. Without it, the script could not distinguish "this
|
||||
entry was in local because upstream had it" from "this entry was in local
|
||||
because someone added it manually."
|
||||
|
||||
If `blacklist.prev` is missing (first run, or manually deleted), the script
|
||||
treats the current `upstream_new` as the baseline. This means manual
|
||||
additions made before the first sync are lost -- on the first run with a
|
||||
fresh `.prev`, add them to `whitelist` instead (where they will survive)
|
||||
or add them after the first sync completes.
|
||||
|
||||
## Edge cases
|
||||
|
||||
### First run
|
||||
|
||||
`blacklist.prev` does not exist, `blacklist` may or may not exist.
|
||||
`upstream_prev = upstream_new`, so `custom = local - upstream_new` (anything
|
||||
in `local` that is not upstream). After the run, `.prev` exists and
|
||||
subsequent runs use the normal path.
|
||||
|
||||
### Empty or missing whitelist
|
||||
|
||||
If `whitelist` is missing or empty, `whitelist = set()` and the subtraction
|
||||
is a no-op. The merge degenerates to a plain upstream sync with local
|
||||
additions preserved.
|
||||
|
||||
### Empty or missing blacklist
|
||||
|
||||
If `blacklist` is missing, `local = set()`, `custom = set()`, and
|
||||
`result = upstream_new - whitelist`. Equivalent to a fresh install.
|
||||
|
||||
### Upstream removes an entry that is also in the whitelist
|
||||
|
||||
Harmless. `upstream_new` does not contain it, so `merged` does not contain
|
||||
it, and the whitelist subtraction removes nothing (the entry was already
|
||||
absent). The whitelist entry stays as a harmless no-op for future syncs.
|
||||
|
||||
### An entry appears in both whitelist and blacklist custom additions
|
||||
|
||||
You manually added `*.foo` to `blacklist` and also added `*.foo` to
|
||||
`whitelist`. The whitelist wins: `*.foo` is in `custom`, survives the
|
||||
union, then gets stripped by the final subtraction. The committed
|
||||
`blacklist` will not contain `*.foo`. The custom entry is effectively
|
||||
invisible until you remove `*.foo` from `whitelist`.
|
||||
|
||||
## Reporting
|
||||
|
||||
Each sync run logs four lines to the workflow output:
|
||||
|
||||
```
|
||||
[blacklist] Upstream added: [...]
|
||||
[blacklist] Upstream removed: [...]
|
||||
[blacklist] Custom preserved: [...]
|
||||
[blacklist] Whitelist stripped: [...]
|
||||
```
|
||||
|
||||
These are sorted lists showing exactly what changed. Check the Actions run
|
||||
log after any sync to see what happened, especially if a consumer reports
|
||||
unexpected behaviour.
|
||||
|
||||
## Full script
|
||||
|
||||
```python
|
||||
import urllib.request
|
||||
|
||||
UPSTREAM_URL = "https://raw.githubusercontent.com/Cleanuparr/Cleanuparr/main/blacklist"
|
||||
BLACKLIST = "blacklist"
|
||||
BLACKLIST_PREV = "blacklist.prev"
|
||||
WHITELIST = "whitelist"
|
||||
|
||||
|
||||
def read_lines(path):
|
||||
try:
|
||||
with open(path) as f:
|
||||
return set(line.strip() for line in f if line.strip())
|
||||
except FileNotFoundError:
|
||||
return set()
|
||||
|
||||
|
||||
def main():
|
||||
with urllib.request.urlopen(UPSTREAM_URL) as r:
|
||||
upstream_new = set(
|
||||
line.strip() for line in r.read().decode().splitlines() if line.strip()
|
||||
)
|
||||
|
||||
upstream_prev = read_lines(BLACKLIST_PREV)
|
||||
if not upstream_prev:
|
||||
upstream_prev = upstream_new.copy()
|
||||
|
||||
local = read_lines(BLACKLIST)
|
||||
whitelist = read_lines(WHITELIST)
|
||||
|
||||
custom = local - upstream_prev
|
||||
merged = upstream_new | custom
|
||||
result = merged - whitelist
|
||||
|
||||
with open(BLACKLIST, "w") as f:
|
||||
f.write("\n".join(sorted(result)) + "\n")
|
||||
|
||||
with open(BLACKLIST_PREV, "w") as f:
|
||||
f.write("\n".join(sorted(upstream_new)) + "\n")
|
||||
```
|
||||
|
||||
Logging and the `__main__` guard are omitted above for clarity. See
|
||||
`scripts/merge_blocklists.py` in the repository for the full source.
|
||||
Reference in New Issue
Block a user