almost ready

This commit is contained in:
bob
2026-06-01 10:52:06 -05:00
parent 8b0eb0db78
commit 763305ca89
94 changed files with 8766 additions and 2674 deletions

View File

@@ -18,13 +18,20 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
WORKDIR /app
COPY worker/requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY worker/worker.py worker/entrypoint.sh ./
# blworker/ is the shared package both market scripts import; ship it + the two thin
# market scripts + the entrypoint.
COPY worker/blworker ./blworker
COPY worker/csmoney_worker.py worker/skinland_worker.py worker/entrypoint.sh ./
RUN chmod +x entrypoint.sh
# Which worker this image runs (overridden per service in docker-compose). The cs.money
# worker is the default; the skin.land service sets WORKER_SCRIPT=skinland_worker.py.
ENV BROWSER_PATH=/usr/bin/chromium \
CHROME_NO_SANDBOX=1 \
DISPLAY=:99 \
SOLVE_SECONDS=45 \
WORKER_SCRIPT=csmoney_worker.py \
LOG_JSON=1 \
PYTHONUNBUFFERED=1

View File

@@ -14,47 +14,27 @@ webdriver` and chromedriver `cdc_` artifacts that Cloudflare keys on. `nodriver`
drives a normal Chromium directly over CDP (no chromedriver) and patches those
tells, so it passes where Selenium loops.
## Step 1: prove it (current)
`poc.py` proves nodriver can clear cs.money's Cloudflare and fetch the listings API
before we build the full pull-based fleet.
## Local setup
```powershell
cd worker
py -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
python poc.py
```
A Chromium window opens on the market. Solve the Cloudflare check if shown; the
script waits, then pages `sell-orders` deeply (PAGES), reporting how far the warm
session survives before any re-challenge and confirming full float precision.
Output lands in `worker/captures/`.
**Targeted skin+wear search.** cs.money search is free-text on the page
(`?search=cyber+security+ft`). Set `SEARCH` and the PoC navigates there, **captures
the actual filtered `sell-orders` API request the page fires** (so we learn the real
filter params instead of guessing), prints it, then pages that filtered API:
```powershell
$env:SEARCH="cyber security ft"; python poc.py # FT M4A4 Cyber Security only
```
The `>>> DISCOVERED sell-orders API call` line shows how the search maps to API
params — that's how the C2 will build targeted jobs.
Run on your own IP first (no proxy) — that's the clean A/B vs. the Selenium run.
If auto-detect can't find a browser, set `BROWSER_PATH` to Chrome or Edge
(`C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe`).
## Step 2: the pull fleet
## The pull fleet
`worker.py` holds one warm nodriver session and loops: poll the .NET C2 for a job
(a skin+wear search), scrape that search's sell-orders via in-page fetch, and post
`csmoney_worker.py` holds one warm nodriver session and loops: poll the .NET C2 for a
job (a skin+wear search), scrape that search's sell-orders via in-page fetch, and post
the items back. The C2 (`BlueLaminate.C2`) picks the stalest skin+wear from the
catalogue, and on result persists to `cs_money_listings` + `price_history`
(`Source = "csmoney"`), stamping `SkinCondition.ListingsSweptAt`.
(`Source = "csmoney"`), stamping that band's per-site checkpoint (the `csmoney`
row in `skin_condition_sweeps`). The checkpoint is per-site, so a band CSFloat
already swept is still due for a cs.money sweep.
Run the C2 (needs Postgres migrated), then the worker:
@@ -65,8 +45,64 @@ dotnet run --project BlueLaminate\BlueLaminate.C2 # serves http://local
# terminal 2 — the worker
cd worker; .venv\Scripts\Activate.ps1
$env:WORKER_TOKEN="dev-worker-token" # must match the C2's WorkerToken
python worker.py
python csmoney_worker.py
```
The worker warms the session (you clear Cloudflare once), then runs continuously.
Scale out by starting more workers (each with its own `PROXY`).
## Layout
Both market scripts are thin: each subclasses `blworker.Worker` and fills in only its
own scrape + cookie-consent steps. Everything shared lives in the `blworker/` package:
| file | responsibility |
| --- | --- |
| `blworker/config.py` | `Settings` — every env knob, parsed once |
| `blworker/log.py` | stdout logging, human or `LOG_JSON=1` (for Loki) |
| `blworker/proxy.py` | IPRoyal forwarder + session/password helpers |
| `blworker/c2.py` | `C2Client` — claim a job, post a result |
| `blworker/runtime.py` | `Worker` base: proxy/browser bring-up, the poll→scrape→post loop, Cloudflare IP rotation, graceful shutdown |
| `csmoney_worker.py` / `skinland_worker.py` | the per-market scrape strategies |
To add a market: subclass `Worker`, set `name`/`jobs_path`/`default_market_url`, implement
`scrape_job` + `describe_job` (+ `dismiss_consent` if it has a banner), and call
`run(YourWorker)`.
## skin.land worker
`skinland_worker.py` is the same pull model for **skin.land** (also Cloudflare-walled). It
shares all the proxy/Cloudflare/C2 plumbing with the cs.money worker via `blworker`; only
the scrape differs. The C2 hands out jobs from its **`/skinland/jobs`** group (the
`skinland` rows in `skin_condition_sweeps`, so a band cs.money/CSFloat already swept is
still due here) and on result persists to `skin_land_listings` + `price_history`
(`Source = "skinland"`).
How it scrapes (learned during discovery):
- A job's target is the market **page URL**, e.g.
`https://skin.land/market/csgo/ak-47-redline-field-tested/`. The slug is just
`{weapon}-{skin}-{wear}` kebab-cased — the C2 builds it from the catalogue, no lookup.
- skin.land is a Nuxt SSR app. The page embeds an internal numeric `skin_id`; the worker
resolves it once from the `__NUXT__` payload (the skin object whose `url` == the slug),
caches it per slug, then pages the clean JSON API
`GET https://app.skin.land/api/v2/obtained-skins?skin_id={id}&page={n}` (a Laravel
paginator `{data:[…offers], meta:{current_page,last_page,…}}`), walking to `last_page`.
- Each offer carries a full-precision `item_float`, `final_withdrawal_price`, and the steam
`item_link`. skin.land exposes **no paint seed**, so listings aren't fingerprinted to a
`SkinInstance` (no cross-market roll-up / dupe detection here). StatTrak and Souvenir are
separate pages (`stattrak-`/`souvenir-` slugs); v1 sweeps the base page per skin+wear.
Run it alongside (or instead of) the cs.money worker — it points at the same C2:
```powershell
cd worker; .venv\Scripts\Activate.ps1
$env:WORKER_TOKEN="dev-worker-token"
python skinland_worker.py
```
Under Docker it's the `skinland-worker` service (same image, `WORKER_SCRIPT=skinland_worker.py`):
```powershell
docker compose up --build --scale skinland-worker=5
```

View File

@@ -0,0 +1,20 @@
"""Shared scaffolding for the BlueLaminate market scrape workers.
A market worker (cs.money, skin.land, …) subclasses `Worker`, fills in its scrape +
consent steps, and calls `run(MyWorker)`. Everything else — config, logging, the IPRoyal
proxy/forwarder, the C2 client, the poll/scrape/post loop, IP rotation, graceful
shutdown — lives here so it's written once.
"""
from .config import Settings
from .runtime import ScrapeResult, Worker, click, looks_like_challenge, page_fetch, run
__all__ = [
"Settings",
"ScrapeResult",
"Worker",
"click",
"looks_like_challenge",
"page_fetch",
"run",
]

57
worker/blworker/c2.py Normal file
View File

@@ -0,0 +1,57 @@
"""HTTP client for the .NET C2's job endpoints.
Stdlib urllib so the blocking calls run off the asyncio loop via to_thread (the event
loop belongs to the browser). Each worker points at one job route group — "/jobs" for
cs.money, "/skinland/jobs" for skin.land — set once at construction.
"""
import asyncio
import json
import logging
import urllib.error
import urllib.request
log = logging.getLogger("c2")
class C2Client:
def __init__(self, base_url: str, token: str, jobs_path: str):
self._base = base_url.rstrip("/")
self._token = token
self._jobs = jobs_path.strip("/")
def _get_job_sync(self):
req = urllib.request.Request(
f"{self._base}/{self._jobs}/next", headers={"X-Worker-Token": self._token})
try:
with urllib.request.urlopen(req, timeout=15) as r:
if r.status == 204:
return None
return json.loads(r.read() or b"null")
except urllib.error.HTTPError as e:
log.warning("/%s/next -> HTTP %s", self._jobs, e.code)
return None
except urllib.error.URLError as e:
log.warning("C2 unreachable: %s", e)
return None
def _post_result_sync(self, job_id: str, payload: dict):
data = json.dumps(payload).encode()
req = urllib.request.Request(
f"{self._base}/{self._jobs}/{job_id}/result", data=data, method="POST",
headers={"X-Worker-Token": self._token, "Content-Type": "application/json"})
try:
with urllib.request.urlopen(req, timeout=60) as r:
return json.loads(r.read() or b"null")
except urllib.error.HTTPError as e:
log.warning("result -> HTTP %s: %r", e.code, e.read()[:200])
return None
except urllib.error.URLError as e:
log.warning("C2 unreachable posting result: %s", e)
return None
async def get_job(self):
return await asyncio.to_thread(self._get_job_sync)
async def post_result(self, job_id, payload):
return await asyncio.to_thread(self._post_result_sync, job_id, payload)

81
worker/blworker/config.py Normal file
View File

@@ -0,0 +1,81 @@
"""Worker configuration, parsed once from the environment.
All env knobs the workers honor live here so there's a single source of truth (the
two market workers used to each re-parse the same ~15 vars). Frozen dataclass — read
it, don't mutate it.
"""
import os
from dataclasses import dataclass
def _int(name: str, default: int) -> int:
return int(os.environ.get(name, str(default)))
def _float(name: str, default: float) -> float:
return float(os.environ.get(name, str(default)))
def _flag(name: str) -> bool:
return os.environ.get(name) == "1"
@dataclass(frozen=True)
class Settings:
# C2
c2_url: str
token: str
# Session / pacing
market_url: str # "" => use the worker's own default page
solve_seconds: int
delay: float
jitter: float
idle_seconds: int
# Browser
browser_path: str | None
load_images: bool
chrome_no_sandbox: bool
# Proxy (auth-free fallback)
proxy: str | None
# IPRoyal residential gateway
iproyal_host: str
iproyal_port: int
iproyal_username: str | None
iproyal_password: str | None
iproyal_country: str
iproyal_lifetime_min: int
# Logging
log_level: str
log_json: bool
@property
def use_iproyal(self) -> bool:
"""IPRoyal takes priority over a plain PROXY when its creds are set."""
return bool(self.iproyal_username and self.iproyal_password)
@classmethod
def from_env(cls) -> "Settings":
return cls(
c2_url=os.environ.get("C2_URL", "http://localhost:5080").rstrip("/"),
token=os.environ.get("WORKER_TOKEN", "dev-worker-token"),
market_url=os.environ.get("MARKET_URL", ""),
solve_seconds=_int("SOLVE_SECONDS", 30),
delay=_float("DELAY", 2.0),
jitter=_float("JITTER", 1.5),
idle_seconds=_int("IDLE_SECONDS", 10),
browser_path=os.environ.get("BROWSER_PATH") or None,
# Residential proxy is metered per GB; Cloudflare gates on JS, not images, and
# the market APIs are pure JSON — so block images unless explicitly debugging.
load_images=_flag("LOAD_IMAGES"),
chrome_no_sandbox=_flag("CHROME_NO_SANDBOX"),
proxy=os.environ.get("PROXY") or None,
iproyal_host=os.environ.get("IPROYAL_HOST", "geo.iproyal.com"),
iproyal_port=_int("IPROYAL_PORT", 12321),
iproyal_username=os.environ.get("IPROYAL_USERNAME") or None,
iproyal_password=os.environ.get("IPROYAL_PASSWORD") or None,
iproyal_country=os.environ.get("IPROYAL_COUNTRY", "us").strip().lower(),
iproyal_lifetime_min=_int("IPROYAL_LIFETIME_MIN", 60),
log_level=os.environ.get("LOG_LEVEL", "INFO").upper(),
log_json=_flag("LOG_JSON"),
)

47
worker/blworker/log.py Normal file
View File

@@ -0,0 +1,47 @@
"""Stdlib logging setup — one stream handler on stdout, human or JSON.
Workers used to print() everything; that gives no levels, no timestamps, and nothing
Loki can parse. Default is a compact human format for local runs; set LOG_JSON=1 in the
container so Grafana Alloy -> Loki gets structured fields (ts, level, logger, msg) plus
any `extra=` keys a call site attaches.
"""
import json
import logging
import sys
# logging.LogRecord built-ins we don't want to echo into a JSON line as "extra" fields.
_RESERVED = set(
logging.makeLogRecord({}).__dict__
) | {"message", "asctime", "taskName"}
class _JsonFormatter(logging.Formatter):
def format(self, record: logging.LogRecord) -> str:
payload = {
"ts": self.formatTime(record, "%Y-%m-%dT%H:%M:%S%z"),
"level": record.levelname,
"logger": record.name,
"msg": record.getMessage(),
}
for key, value in record.__dict__.items():
if key not in _RESERVED and not key.startswith("_"):
payload[key] = value
if record.exc_info:
payload["exc"] = self.formatException(record.exc_info)
return json.dumps(payload, default=str)
def configure(level: str = "INFO", json_logs: bool = False) -> None:
"""Install a single stdout handler on the root logger (idempotent)."""
handler = logging.StreamHandler(sys.stdout)
if json_logs:
handler.setFormatter(_JsonFormatter())
else:
handler.setFormatter(
logging.Formatter("%(asctime)s %(levelname)-5s %(name)s | %(message)s", "%H:%M:%S")
)
root = logging.getLogger()
root.handlers.clear()
root.addHandler(handler)
root.setLevel(level)

154
worker/blworker/proxy.py Normal file
View File

@@ -0,0 +1,154 @@
"""IPRoyal residential proxy plumbing.
The in-process forwarder + the password/session helpers — identical across every market
worker, so they live here. HTTPS market traffic flows through the CONNECT tunnel, so the
forwarder only ever relays ciphertext. Ported from the .NET LocalForwardingProxy /
IpRoyalProxyProvider.
"""
import asyncio
import base64
import logging
import uuid
log = logging.getLogger("proxy")
def new_session_id() -> str:
"""Short, opaque, URL-safe token. IPRoyal pins one residential exit IP per distinct
session value, so a fresh id == a fresh IP."""
return uuid.uuid4().hex[:10]
def iproyal_password(password: str, country: str, lifetime_min: int, session_id: str) -> str:
"""Bake the targeting/session knobs onto the account password, IPRoyal-style:
"<pass>_country-us_session-<id>_lifetime-60m". Country is optional."""
pw = password
if country:
pw += f"_country-{country}"
pw += f"_session-{session_id}_lifetime-{lifetime_min}m"
return pw
class LocalForwardingProxy:
"""In-process HTTP proxy on 127.0.0.1 that chains every connection to the IPRoyal
gateway, injecting the Proxy-Authorization header itself. Chromium ignores creds in
--proxy-server and the in-browser ways to answer the gateway's 407 (a CDP auth
handler, or a disabled MV2 extension) are Cloudflare tells — so we terminate the
browser->proxy hop locally and add auth here, leaving Chrome to talk to an auth-free
endpoint at zero CDP. HTTPS (all market traffic) flows through the CONNECT tunnel, so
this proxy only relays ciphertext and never sees plaintext. The active session token
can be swapped live (set_password) to move to a fresh exit IP without restarting the
browser. (New tunnels pick up the new IP; any still-open keep-alive tunnel stays on
the old one until it closes.)"""
def __init__(self, host: str, port: int, username: str, password: str):
self._host = host
self._port = port
self._username = username
self._password = password
self._server: asyncio.AbstractServer | None = None
self.endpoint = ""
def set_password(self, password: str) -> None:
self._password = password
def _auth_header(self) -> str:
token = base64.b64encode(f"{self._username}:{self._password}".encode()).decode()
return f"Proxy-Authorization: Basic {token}\r\n"
async def start(self) -> "LocalForwardingProxy":
self._server = await asyncio.start_server(self._handle, "127.0.0.1", 0)
port = self._server.sockets[0].getsockname()[1]
self.endpoint = f"127.0.0.1:{port}"
return self
async def stop(self) -> None:
if self._server is not None:
self._server.close()
try:
await self._server.wait_closed()
except Exception:
pass
@staticmethod
async def _read_header(reader: asyncio.StreamReader) -> str | None:
"""Read up to the end of the HTTP header block (CRLFCRLF). None on EOF/overflow."""
try:
data = await reader.readuntil(b"\r\n\r\n")
except (asyncio.IncompleteReadError, asyncio.LimitOverrunError):
return None
return data.decode("latin-1")
async def _handle(self, client_reader: asyncio.StreamReader, client_writer: asyncio.StreamWriter) -> None:
up_writer: asyncio.StreamWriter | None = None
try:
header = await self._read_header(client_reader)
if not header:
return
parts = header.split("\r\n", 1)[0].split(" ")
if len(parts) < 2:
return
method, target = parts[0], parts[1]
up_reader, up_writer = await asyncio.open_connection(self._host, self._port)
if method.upper() == "CONNECT":
# HTTPS: open an authenticated tunnel upstream, then relay raw bytes.
up_writer.write(
f"CONNECT {target} HTTP/1.1\r\nHost: {target}\r\n{self._auth_header()}\r\n".encode())
await up_writer.drain()
up_header = await self._read_header(up_reader)
status = up_header.split(" ", 2) if up_header else []
if len(status) < 2 or status[1] != "200":
line = (up_header or "no response").split("\r\n", 1)[0]
log.warning("upstream refused CONNECT %s: %s", target, line)
client_writer.write(b"HTTP/1.1 502 Bad Gateway\r\nConnection: close\r\n\r\n")
await client_writer.drain()
return
client_writer.write(b"HTTP/1.1 200 Connection established\r\n\r\n")
await client_writer.drain()
else:
# Plain HTTP: re-inject the request upstream with auth, then relay.
idx = header.index("\r\n") + 2
up_writer.write((header[:idx] + self._auth_header() + header[idx:]).encode())
await up_writer.drain()
await self._relay(client_reader, client_writer, up_reader, up_writer)
except Exception:
pass # one bad tunnel must never take down the listener
finally:
for w in (client_writer, up_writer):
if w is not None:
try:
w.close()
except Exception:
pass
@staticmethod
async def _relay(
client_reader: asyncio.StreamReader, client_writer: asyncio.StreamWriter,
up_reader: asyncio.StreamReader, up_writer: asyncio.StreamWriter) -> None:
# Pipe both directions, but tear the whole tunnel down as soon as EITHER side
# closes (mirrors the .NET WhenAny). Waiting for both — as a plain gather does —
# leaks a task holding two sockets on every half-closed connection, which piles
# up fast across a long multi-worker run. Closing both writers when the first pipe
# finishes unblocks the other's pending read so both tasks settle.
async def pipe(reader: asyncio.StreamReader, writer: asyncio.StreamWriter) -> None:
try:
while data := await reader.read(65536):
writer.write(data)
await writer.drain()
except Exception:
pass
a = asyncio.create_task(pipe(client_reader, up_writer))
b = asyncio.create_task(pipe(up_reader, client_writer))
try:
await asyncio.wait({a, b}, return_when=asyncio.FIRST_COMPLETED)
finally:
for w in (client_writer, up_writer):
try:
w.close()
except Exception:
pass
await asyncio.gather(a, b, return_exceptions=True)

235
worker/blworker/runtime.py Normal file
View File

@@ -0,0 +1,235 @@
"""The shared worker runtime — everything that's identical across market workers.
`Worker` is a template-method base: it owns the proxy/browser bring-up, the poll ->
scrape -> post loop, Cloudflare-driven IP rotation, result logging, and graceful
shutdown. A market worker subclasses it and fills in only what differs — how to dismiss
the consent banner, how to scrape one job, and how to describe a job in the log. The two
~300-line workers used to copy this whole loop verbatim.
"""
import asyncio
import json
import logging
import random
import signal
from abc import ABC, abstractmethod
from dataclasses import dataclass
import nodriver as uc
from .c2 import C2Client
from .config import Settings
from .proxy import LocalForwardingProxy, iproyal_password, new_session_id
@dataclass
class ScrapeResult:
"""What a single job scrape yields. `wire_bytes` is the metered (compressed) cost."""
items: list
pages: int
reason: str
wire_bytes: int = 0
def looks_like_challenge(body: str) -> bool:
"""True for an actual Cloudflare interstitial (or an empty body). Keyed on CF markers,
NOT a leading '<' — a real market page IS html, so a startswith('<') check would flag
every good page fetch as a challenge."""
b = body or ""
return not b.strip() or "Just a moment" in b or "challenge-platform" in b
async def page_fetch(page, url: str, accept: str = "application/json") -> tuple[int, str, int]:
"""Fetch in-page from the warm (Cloudflare-cleared) session and read back the Resource
Timing transferSize — the actual compressed bytes the metered proxy bills (or -1 when
cross-origin timing isn't exposed). Returns (status, body, wire_bytes). Use
accept='text/html' for an SSR page payload, the default JSON for an API."""
expr = (
f"fetch({url!r}, {{credentials:'include', headers:{{'accept': {accept!r}}}}})"
f".then(async r => {{"
f" const body = await r.text();"
f" const e = performance.getEntriesByName({url!r}).slice(-1)[0];"
f" return JSON.stringify({{status: r.status, body: body, wire: e ? e.transferSize : -1}});"
f"}}).catch(e => JSON.stringify({{status: -1, body: String(e), wire: -1}}))"
)
raw = await page.evaluate(expr, await_promise=True)
if not isinstance(raw, str):
return (-1, "", -1)
try:
obj = json.loads(raw)
return (int(obj.get("status", -1)), obj.get("body", ""), int(obj.get("wire", -1)))
except (json.JSONDecodeError, ValueError, TypeError):
return (-1, raw, -1)
async def click(page, text: str, timeout: int = 3) -> bool:
"""Best-match click on visible text; swallow the not-found/timeout case."""
try:
el = await page.find(text, best_match=True, timeout=timeout)
if el:
await el.click()
return True
except Exception:
pass
return False
class Worker(ABC):
# Per-market constants, set by the subclass.
name: str = "worker"
jobs_path: str = "/jobs"
default_market_url: str = ""
def __init__(self, settings: Settings):
self.settings = settings
self.market_url = settings.market_url or self.default_market_url
self.c2 = C2Client(settings.c2_url, settings.token, self.jobs_path)
self.log = logging.getLogger(self.name)
self._forwarder: LocalForwardingProxy | None = None
self._session_id: str | None = None
self._stop = asyncio.Event()
# --- hooks a market worker overrides ------------------------------------------
@abstractmethod
async def scrape_job(self, page, job) -> ScrapeResult:
"""Scrape ALL listings for one job and return them."""
@abstractmethod
def describe_job(self, job) -> str:
"""One-line job description for the log (e.g. the search term or slug)."""
async def dismiss_consent(self, page) -> str | None:
"""Dismiss the cookie banner privacy-first; return a note, or None if absent.
Default: nothing to do. Markets with a banner override this."""
return None
# --- shared machinery ---------------------------------------------------------
def _iproyal_password(self, session_id: str) -> str:
s = self.settings
return iproyal_password(s.iproyal_password, s.iproyal_country, s.iproyal_lifetime_min, session_id)
async def _pace(self, page) -> None:
await page.sleep(self.settings.delay + random.uniform(0, self.settings.jitter))
async def warm(self, page) -> None:
"""Open the market and clear Cloudflare so the session holds cf_clearance."""
s = self.settings
self.log.info("warming session at %s (clear Cloudflare; %ds)", self.market_url, s.solve_seconds)
await page.get(self.market_url)
await page.sleep(s.solve_seconds)
note = await self.dismiss_consent(page)
self.log.info("consent: %s", note or "left up")
async def _setup_proxy(self) -> tuple[str | None, str]:
"""IPRoyal (auth'd, per-worker sticky IP) takes priority; else a plain auth-free
PROXY; else this host's own IP. Returns (proxy_endpoint, human_label)."""
s = self.settings
if s.use_iproyal:
self._session_id = new_session_id()
self._forwarder = await LocalForwardingProxy(
s.iproyal_host, s.iproyal_port, s.iproyal_username,
self._iproyal_password(self._session_id)).start()
label = f"iproyal[{s.iproyal_country or 'any'}] session {self._session_id} via {self._forwarder.endpoint}"
return self._forwarder.endpoint, label
return s.proxy, (s.proxy or "own IP")
def _browser_args(self, proxy: str | None) -> list[str]:
s = self.settings
args = [f"--proxy-server={proxy}"] if proxy else []
if not s.load_images:
# Disable image loading at the engine level — the dominant bandwidth cost on
# an image-heavy market, and unneeded for CF clearance or the JSON API.
args.append("--blink-settings=imagesEnabled=false")
if s.chrome_no_sandbox:
# Required when running Chromium as root in a container.
args += ["--no-sandbox", "--disable-dev-shm-usage"]
return args
async def _on_challenge(self, page) -> None:
"""The exit IP is likely flagged. On IPRoyal, rotate to a fresh sticky session
(new IP) before re-warming; otherwise just re-solve in place."""
if self._forwarder is not None:
self._session_id = new_session_id()
self._forwarder.set_password(self._iproyal_password(self._session_id))
self.log.warning("challenged; rotating exit IP -> session %s, re-warming", self._session_id)
else:
self.log.warning("challenged; re-warming session")
await self.warm(page)
def _log_result(self, res: ScrapeResult, posted: dict | None, total_wire: int) -> None:
if posted:
summary = (f"matched {posted.get('matched')}, new {posted.get('inserted')}, "
f"upd {posted.get('updated')}, removed {posted.get('removed')}")
else:
summary = "post failed"
self.log.info("scraped %d items (%dp, %s, %.0fKB wire) -> %s [lifetime %.1fMB]",
len(res.items), res.pages, res.reason, res.wire_bytes / 1024,
summary, total_wire / 1_048_576)
def _install_signal_handlers(self) -> None:
"""Stop the loop on SIGINT/SIGTERM so `docker stop` shuts down cleanly. Not
supported on Windows (ProactorEventLoop) — there Ctrl-C still raises
KeyboardInterrupt, which the run loop's finally handles just as well."""
try:
loop = asyncio.get_running_loop()
for sig in (signal.SIGINT, signal.SIGTERM):
loop.add_signal_handler(sig, self._stop.set)
except (NotImplementedError, AttributeError):
pass
async def _idle(self) -> None:
"""Sleep when the C2 has no work, but wake immediately on shutdown."""
try:
await asyncio.wait_for(self._stop.wait(), timeout=self.settings.idle_seconds)
except asyncio.TimeoutError:
pass
async def run(self) -> None:
self._install_signal_handlers()
s = self.settings
proxy, proxy_label = await self._setup_proxy()
self.log.info("starting (C2=%s, proxy=%s, images=%s)",
s.c2_url, proxy_label, "on" if s.load_images else "off")
browser = await uc.start(
headless=False, browser_executable_path=s.browser_path,
browser_args=self._browser_args(proxy))
try:
page = await browser.get("about:blank")
await self.warm(page)
total_wire = 0 # metered (compressed) bytes pulled, lifetime
while not self._stop.is_set():
job = await self.c2.get_job()
if not job:
await self._idle()
continue
self.log.info("job %s%s", job["jobId"][:8], self.describe_job(job))
res = await self.scrape_job(page, job)
total_wire += res.wire_bytes
if res.reason == "challenged":
await self._on_challenge(page)
posted = await self.c2.post_result(job["jobId"], {
"items": res.items, "pages": res.pages, "stoppedReason": res.reason})
self._log_result(res, posted, total_wire)
await self._pace(page)
finally:
self.log.info("shutting down")
browser.stop()
if self._forwarder is not None:
await self._forwarder.stop()
def run(worker_cls: type[Worker]) -> None:
"""Boot a worker from the environment: parse config, set up logging, run the loop on
nodriver's event loop. The thin market scripts call this and nothing else."""
from . import log as log_setup
settings = Settings.from_env()
log_setup.configure(settings.log_level, settings.log_json)
uc.loop().run_until_complete(worker_cls(settings).run())

129
worker/csmoney_worker.py Normal file
View File

@@ -0,0 +1,129 @@
"""cs.money scrape worker (pull model).
A thin strategy over blworker.Worker: it supplies only the cs.money-specific bits — the
consent banner steps and how to scrape one skin+wear's sell-orders. The warm session, the
poll/scrape/post loop, the IPRoyal proxy and IP rotation, logging and shutdown all live in
the shared runtime. Env knobs are documented in worker/README.md.
cs.money is an Astro SSR app: the free-text market search filters server-side and the
resulting listings are embedded in the page as a __page-params JSON blob. The
/2.0/market/sell-orders API rejects a `search` param (HTTP 400), so we fetch the PAGE for
a search and read the embedded items — same item shape as the API.
A page returns at most 60 and offset is ignored, so we paginate with a FORWARD CURSOR on
float: cs.money honors `order=asc&sort=float` + `minFloat`, and float is full-precision and
effectively unique per item. We grab the 60 lowest-float items at/above `lo`, advance `lo`
to the highest float returned, and repeat until a page is under the cap. (The old
minPrice/maxPrice bisection silently truncated cheap skins: >60 listings can share a
sub-$0.02 reference band, which no price window can split — floats almost never tie, so the
cursor always makes progress.)
cd worker
.venv\\Scripts\\Activate.ps1
pip install -r requirements.txt
python csmoney_worker.py
"""
import json
import re
import urllib.parse
from blworker import ScrapeResult, Worker, click, page_fetch, run
PAGE = ("https://cs.money/market/buy/?search={search}"
"&order=asc&sort=float&minFloat={lo:.12f}&maxFloat=1")
PAGE_CAP = 60 # items per SSR page
PAGE_PARAMS_RE = re.compile(
r'<script\b[^>]*id="__page-params"[^>]*>(.*?)</script>', re.S)
def extract_items(html: str) -> list:
"""Pull inventory.items out of the page's __page-params JSON blob."""
m = PAGE_PARAMS_RE.search(html)
if not m:
return []
try:
return json.loads(m.group(1)).get("inventory", {}).get("items", []) or []
except json.JSONDecodeError:
return []
class CsMoneyWorker(Worker):
name = "csmoney"
jobs_path = "/jobs"
default_market_url = "https://cs.money/market/buy/"
def describe_job(self, job) -> str:
return f"search {job['search']!r}"
async def dismiss_consent(self, page) -> str | None:
"""Privacy-preserving. The banner only offers 'Accept all' / 'Manage cookies';
the Reject-all control lives inside the Manage window. So: Manage -> Reject all ->
Confirm. (The data path reads SSR __page-params regardless, but this keeps the
session honest and unblocks any future interaction.)"""
steps = []
if await click(page, "Manage cookies") or await click(page, "Manage"):
await page.sleep(1)
if await click(page, "Reject all"):
steps.append("reject-all")
for c in ("Confirm my choice", "Confirm", "Save"):
if await click(page, c):
steps.append(f"confirm:{c}")
break
return ", ".join(steps) if steps else None
async def scrape_job(self, page, job) -> ScrapeResult:
"""Scrape ALL listings for one skin+wear via a forward float cursor.
Grab the 60 lowest-float items at/above `lo`, advance `lo` to the highest float on
the page, repeat until a page is under the cap. The boundary item is re-fetched
(minFloat is inclusive) and dropped by the id dedup."""
search = urllib.parse.quote_plus(job["search"])
max_fetches = job.get("maxPages", 40) # safety cap on page fetches per job
seen: dict = {}
fetches = 0
wire = 0
lo = 0.0
reason = "completed"
while fetches < max_fetches:
_status, body, wbytes = await page_fetch(page, PAGE.format(search=search, lo=lo))
fetches += 1
if wbytes > 0:
wire += wbytes
if "Just a moment" in body or "challenge-platform" in body:
return ScrapeResult(list(seen.values()), fetches, "challenged", wire)
items = extract_items(body)
floats = []
for it in items:
if it.get("id") is not None:
seen[it["id"]] = it
fl = (it.get("asset") or {}).get("float")
if isinstance(fl, (int, float)):
floats.append(fl)
if len(items) < PAGE_CAP:
break # last page — fewer than the cap means we've seen everything
# Advance the cursor past the highest float on this page. Items at exactly that
# float are re-fetched next round (minFloat is inclusive) and deduped by id.
nxt = max(floats) if floats else None
if nxt is None or nxt <= lo:
# Cursor can't advance: >60 listings share a single float value, or the
# items carry no float. Bail loudly rather than spin — a flagged gap beats
# a silent one (this is the failure the price-window version hid).
reason = "stuck-float-tie"
break
lo = nxt
await self._pace(page)
else:
reason = "fetch-cap"
return ScrapeResult(list(seen.values()), fetches, reason, wire)
if __name__ == "__main__":
run(CsMoneyWorker)

View File

@@ -1,71 +0,0 @@
"""
Diagnose the cs.money cookie-consent banner so we can dismiss it programmatically.
It's likely a Shadow DOM web component (CookieConsentSystem), which is why
document.querySelectorAll-based clicks miss the real buttons.
Saves:
captures/_consent.png - screenshot (so we can SEE the banner + button positions)
captures/_consent.txt - shadow-host tags + every consent-like button found by
piercing shadow roots, with center coordinates.
cd worker; .venv\\Scripts\\Activate.ps1
python diag_consent.py
"""
import json
import os
import pathlib
import nodriver as uc
URL = os.environ.get("URL", "https://cs.money/market/buy/?search=ak-47+redline")
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
BROWSER_PATH = os.environ.get("BROWSER_PATH")
OUT = pathlib.Path(__file__).parent / "captures"
# Pierce shadow roots to find consent buttons + their viewport-center coords.
DEEP_FIND = r"""
JSON.stringify((()=>{
const hits=[], hosts=[];
function walk(root){
root.querySelectorAll('*').forEach(e=>{
if(e.shadowRoot){ hosts.push(e.tagName.toLowerCase()); walk(e.shadowRoot); }
const t=(e.textContent||'').trim();
if(t.length<40 && /accept all|manage cookies|reject all|confirm my choice|^accept$|^manage$/i.test(t)){
const r=e.getBoundingClientRect();
if(r.width>0&&r.height>0)
hits.push({tag:e.tagName, text:t, x:Math.round(r.x+r.width/2), y:Math.round(r.y+r.height/2)});
}
});
}
walk(document);
return {shadowHosts:[...new Set(hosts)], buttons:hits};
})())
"""
async def main():
OUT.mkdir(exist_ok=True)
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH)
try:
page = await browser.get(URL)
print(f"Loaded {URL}; waiting {SOLVE_SECONDS}s for Cloudflare...")
await page.sleep(SOLVE_SECONDS)
png = str(OUT / "_consent.png")
await page.save_screenshot(png)
print(f"screenshot -> {png}")
raw = await page.evaluate(DEEP_FIND)
info = json.loads(raw) if isinstance(raw, str) else {"error": repr(raw)}
(OUT / "_consent.txt").write_text(json.dumps(info, indent=2), encoding="utf-8")
print("shadow hosts:", info.get("shadowHosts"))
print("consent buttons found:")
for b in info.get("buttons", []):
print(f" {b}")
finally:
browser.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())

View File

@@ -1,183 +0,0 @@
"""
Discover how cs.money paginates a filtered search past the initial ~60 SSR items.
Tests two hypotheses against a high-result search (default "ak-47 redline", which has
well over 60 listings):
A. Does the SSR page honor offset/limit in the URL? Fetch ?search=...&offset=60 and
?search=...&limit=120 and compare item ids to page 1. If disjoint/larger, we can
paginate cheaply by re-fetching the page.
B. The real client "load more": scroll hard to trigger lazy-load and capture any
cs.money /2.0/ XHR via Resource Timing — that request carries the structured
filter params + offset, i.e. a lighter direct-API pagination path.
Findings are printed and saved to captures/_pagination.txt.
cd worker; .venv\\Scripts\\Activate.ps1
python discover_pagination.py
$env:SEARCH="ak-47 redline"; python discover_pagination.py # override the search
"""
import json
import os
import pathlib
import re
import nodriver as uc
from nodriver import cdp
SEARCH = os.environ.get("SEARCH", "ak-47 redline")
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
BROWSER_PATH = os.environ.get("BROWSER_PATH")
PROXY = os.environ.get("PROXY")
BASE = "https://cs.money/market/buy/"
PAGE_PARAMS_RE = re.compile(r'<script\b[^>]*id="__page-params"[^>]*>(.*?)</script>', re.S)
OUT = pathlib.Path(__file__).parent / "captures"
CONSENT = ["Reject all", "Only necessary", "Reject", "Decline", "Deny"]
# Aggressive scroll: window + every scrollable container (the grid scrolls in a div,
# which is why a plain window.scrollTo didn't trigger lazy-load before).
SCROLL_JS = (
"window.scrollTo(0, document.body.scrollHeight);"
"document.querySelectorAll('*').forEach(e=>{"
" if (e.scrollHeight > e.clientHeight + 80) e.scrollTop = e.scrollHeight;});")
async def js(page, expr):
raw = await page.evaluate(f"JSON.stringify({expr})")
try:
return json.loads(raw) if isinstance(raw, str) else None
except (json.JSONDecodeError, TypeError):
return None
async def fetch_text(page, url):
expr = (f"fetch({url!r},{{credentials:'include'}}).then(async r=>"
f"JSON.stringify({{status:r.status, body:await r.text()}}))")
raw = await page.evaluate(expr, await_promise=True)
try:
o = json.loads(raw)
return o.get("status"), o.get("body", "")
except (json.JSONDecodeError, TypeError):
return None, ""
def page_item_ids(html):
m = PAGE_PARAMS_RE.search(html or "")
if not m:
return []
try:
return [it.get("id") for it in json.loads(m.group(1)).get("inventory", {}).get("items", [])]
except json.JSONDecodeError:
return []
async def click_visible(page, pattern):
"""Click the first VISIBLE element whose trimmed text matches `pattern` (case-
insensitive). nodriver's find() was matching hidden/duplicate nodes; restricting
to offsetParent!=null + short text hits the real button."""
expr = ("JSON.stringify((()=>{"
"const re=new RegExp(" + json.dumps(pattern) + ",'i');"
"const els=[...document.querySelectorAll('button,a,[role=\"button\"],span,div')];"
"const b=els.find(e=>e.offsetParent!==null && (e.textContent||'').trim().length<40 "
"&& re.test((e.textContent||'').trim()));"
"if(b){b.click();return true}return false})())")
r = await page.evaluate(expr)
return isinstance(r, str) and "true" in r
async def banner_present(page):
r = await page.evaluate(
"JSON.stringify(/Manage cookies|Accept all/i.test(document.body.innerText||''))")
return isinstance(r, str) and "true" in r
async def dismiss(page):
"""Privacy-preserving first (Manage -> Reject all -> Confirm); if the banner is
still up, fall back to Accept all so the page becomes interactive (discovery
needs scrolling to work)."""
steps = []
if await click_visible(page, "manage cookies|^manage$"):
steps.append("manage")
await page.sleep(1.2)
if await click_visible(page, "reject all"):
steps.append("reject-all")
await page.sleep(0.4)
for c in ("confirm my choice", "^confirm$", "^save$"):
if await click_visible(page, c):
steps.append("confirm")
break
await page.sleep(1)
if await banner_present(page):
steps.append("still-up->accept" if await click_visible(page, "accept all|^accept$") else "still-up")
await page.sleep(0.5)
steps.append("gone" if not await banner_present(page) else "STILL-PRESENT")
return ", ".join(steps)
async def main():
OUT.mkdir(exist_ok=True)
args = [f"--proxy-server={PROXY}"] if PROXY else []
args.append("--blink-settings=imagesEnabled=false")
from urllib.parse import quote_plus
q = quote_plus(SEARCH)
findings = []
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH, browser_args=args)
try:
url0 = f"{BASE}?search={q}"
page = await browser.get(url0)
print(f"Warming on {url0} ({SOLVE_SECONDS}s for Cloudflare)...")
await page.sleep(SOLVE_SECONDS)
print(f"Consent: {await dismiss(page)}")
# --- A. URL offset/limit on the SSR page ---
_, h0 = await fetch_text(page, f"{BASE}?search={q}")
_, h1 = await fetch_text(page, f"{BASE}?search={q}&offset=60")
_, h2 = await fetch_text(page, f"{BASE}?search={q}&limit=120")
a, b, c = page_item_ids(h0), page_item_ids(h1), page_item_ids(h2)
overlap = len(set(a) & set(b))
findings.append(f"page1 ids={len(a)} offset=60 ids={len(b)} (overlap with page1={overlap}) limit=120 ids={len(c)}")
findings.append(f" -> offset works? {'YES (disjoint)' if b and overlap == 0 else 'no/ignored'}")
findings.append(f" -> limit works? {'YES (>60)' if len(c) > 60 else 'no/ignored'}")
# --- B. Trigger client load-more, capture cs.money /2.0/ XHRs ---
# Infinite scroll only fires on GRADUAL downward scrolling — jumping to the
# bottom skips the trigger. So step down in small wheel increments and watch
# the item count grow.
before = set(await js(page, "performance.getEntriesByType('resource').map(e=>e.name)") or [])
async def card_count():
n = await page.evaluate(
"JSON.stringify(document.querySelectorAll('[href*=\"/item/\"],[class*=\"item\" i]').length)")
return n
print(f" cards before scroll: {await card_count()}")
for step in range(60):
try:
await page.send(cdp.input_.dispatch_mouse_event(
type_="mouseWheel", x=720, y=450, delta_x=0, delta_y=500))
except Exception:
pass
await page.sleep(0.7)
if step % 15 == 14:
now = [u for u in (await js(page, "performance.getEntriesByType('resource').map(e=>e.name)") or [])
if u not in before and "cs.money" in u and "metrics." not in u and "traces." not in u]
print(f" step {step+1}: cards={await card_count()} new cs.money reqs={len(now)}")
after = await js(page, "performance.getEntriesByType('resource').map(e=>e.name)") or []
new_xhrs = [u for u in after if u not in before and "cs.money" in u
and "metrics." not in u and "traces." not in u]
findings.append(f"\nclient requests after scrolling ({len(new_xhrs)} new cs.money):")
findings.extend(f" {u}" for u in dict.fromkeys(new_xhrs))
if not new_xhrs:
findings.append(" (none — grid may not lazy-load via XHR, or scroll didn't reach the trigger)")
report = "\n".join(findings)
print("\n=== FINDINGS ===\n" + report)
(OUT / "_pagination.txt").write_text(f"search: {SEARCH}\n\n{report}\n", encoding="utf-8")
print(f"\nsaved to {OUT / '_pagination.txt'}")
finally:
browser.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())

View File

@@ -1,96 +0,0 @@
"""
Find cs.money's price-filter URL param (the basis for price-bucket pagination).
The market has a Price from/to filter in the sidebar. `search=` works via the URL and
the page SSRs the filtered listings into __page-params, so a price param likely works
the same way. We baseline the cheapest set, then try candidate param names with a high
floor and check whether the returned listings actually shift above it.
cd worker; .venv\\Scripts\\Activate.ps1
python discover_price_param.py
"""
import json
import os
import pathlib
import re
from urllib.parse import quote_plus
import nodriver as uc
SEARCH = os.environ.get("SEARCH", "ak-47 redline")
FLOOR = float(os.environ.get("FLOOR", "200"))
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
BROWSER_PATH = os.environ.get("BROWSER_PATH")
BASE = "https://cs.money/market/buy/"
PP = re.compile(r'<script\b[^>]*id="__page-params"[^>]*>(.*?)</script>', re.S)
OUT = pathlib.Path(__file__).parent / "captures"
# Param-name variants for a price floor (and a couple of from/to pairs).
CANDIDATES = [
"minPrice", "priceFrom", "price_from", "priceMin", "min_price",
"priceGte", "from", "price_min", "minprice", "price.gte", "pricegte",
]
async def fetch_prices(page, url):
expr = (f"fetch({url!r},{{credentials:'include'}}).then(async r=>"
f"JSON.stringify({{status:r.status, body:await r.text()}}))")
raw = await page.evaluate(expr, await_promise=True)
try:
body = json.loads(raw).get("body", "")
except (json.JSONDecodeError, TypeError):
return None
m = PP.search(body or "")
if not m:
return None
try:
items = json.loads(m.group(1)).get("inventory", {}).get("items", [])
except json.JSONDecodeError:
return None
return [it.get("pricing", {}) for it in items if it.get("pricing")]
async def main():
OUT.mkdir(exist_ok=True)
q = quote_plus(SEARCH)
lines = []
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH,
browser_args=["--blink-settings=imagesEnabled=false"])
try:
page = await browser.get(f"{BASE}?search={q}")
print(f"Warming ({SOLVE_SECONDS}s)..."); await page.sleep(SOLVE_SECONDS)
# Test minPrice/maxPrice semantics directly (old cs.money API used these).
tests = [
("baseline", f"{BASE}?search={q}"),
("maxPrice=200", f"{BASE}?search={q}&maxPrice=200"),
("minPrice=300", f"{BASE}?search={q}&minPrice=300"),
("minPrice=300&maxPrice=400", f"{BASE}?search={q}&minPrice=300&maxPrice=400"),
("minPrice=500&maxPrice=1000", f"{BASE}?search={q}&minPrice=500&maxPrice=1000"),
]
def rng(pr, field):
vals = [p.get(field) for p in pr if isinstance(p.get(field), (int, float))]
return (min(vals), max(vals)) if vals else (None, None)
for name, url in tests:
pr = await fetch_prices(page, url)
if not pr:
lines.append(f"{name:28} -> no items")
else:
d0, d1 = rng(pr, "default")
c0, c1 = rng(pr, "computed")
b0, b1 = rng(pr, "basePrice")
lines.append(f"{name:28} -> n={len(pr)} default[{d0:.2f},{d1:.2f}] "
f"computed[{c0:.2f},{c1:.2f}] base[{b0:.2f},{b1:.2f}]")
print(lines[-1])
(OUT / "_price_param.txt").write_text(
f"search={SEARCH} floor={FLOOR}\n\n" + "\n".join(lines), encoding="utf-8")
print(f"\nsaved to {OUT/'_price_param.txt'}")
finally:
browser.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())

View File

@@ -15,5 +15,6 @@ x11vnc -display "${DISPLAY_NUM}" -forever -shared -nopw -quiet -bg
echo "[entrypoint] starting noVNC on :6080 (open http://localhost:6080/vnc.html)"
websockify --web=/usr/share/novnc 6080 localhost:5900 &
echo "[entrypoint] launching worker"
exec python worker.py
WORKER_SCRIPT="${WORKER_SCRIPT:-csmoney_worker.py}"
echo "[entrypoint] launching ${WORKER_SCRIPT}"
exec python "${WORKER_SCRIPT}"

View File

@@ -1,285 +0,0 @@
"""
Proof-of-concept / pre-fleet validation for the cs.money scraper.
Proves the things we need before building the C2 + worker fleet:
1. nodriver clears cs.money's Cloudflare where .NET Selenium couldn't.
2. a single WARM session can page the sell-orders API deeply without re-challenge.
3. a free-text market search (e.g. "cyber security ft") can be turned into a
filtered sell-orders API call — we DISCOVER the real API params by capturing the
request the page itself fires, instead of guessing.
It opens the market (optionally a search URL) in a real non-headless Chromium, lets
you clear Cloudflare, dismisses the cookie banner (privacy-preserving), captures the
sell-orders request the page makes, then pages that API from inside the cleared page
(same-origin fetch carries cf_clearance), pacing itself and stopping on re-challenge.
cd worker
.venv\\Scripts\\Activate.ps1
pip install -r requirements.txt
python poc.py # whole-market sweep
$env:SEARCH="cyber security ft"; python poc.py # targeted: FT M4A4 Cyber Security
Env knobs (all optional):
SEARCH free-text market search; when set, scrape only those results
MARKET_URL market page base (default the buy market)
SOLVE_SECONDS seconds to wait for you to clear Cloudflare (default 30)
PAGES how many offset pages (60 each) to attempt (default 20)
START_OFFSET first offset (default 0)
DELAY / JITTER base + random seconds between fetches (default 2.0 / 1.5)
PROXY host:port for an auth-free proxy (omit to use your own IP)
BROWSER_PATH path to Chrome/Edge if auto-detect fails
"""
import json
import os
import pathlib
import random
from urllib.parse import quote_plus, urlsplit, parse_qsl, urlencode, urlunsplit
import nodriver as uc
from nodriver import cdp
SEARCH = os.environ.get("SEARCH")
MARKET_URL = os.environ.get("MARKET_URL", "https://cs.money/market/buy/")
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
PAGES = int(os.environ.get("PAGES", "20"))
START_OFFSET = int(os.environ.get("START_OFFSET", "0"))
DELAY = float(os.environ.get("DELAY", "2.0"))
JITTER = float(os.environ.get("JITTER", "1.5"))
PROXY = os.environ.get("PROXY")
BROWSER_PATH = os.environ.get("BROWSER_PATH")
# Fallback template if we fail to capture the page's own request (offset = {}).
DEFAULT_TEMPLATE = "https://cs.money/2.0/market/sell-orders?limit=60&offset={}"
OUT_DIR = pathlib.Path(__file__).parent / "captures"
CONSENT_LABELS = ["Reject all", "Reject All", "Only necessary", "Necessary only",
"Reject", "Decline", "Deny"]
# Filled by the CDP network handler with sell-orders request URLs the page fires.
_seen_urls: list[str] = []
def looks_like_challenge(body: str) -> bool:
s = (body or "").lstrip()
return not s or s.startswith("<") or "Just a moment" in body or "challenge-platform" in body
def decimals(v: float) -> int:
r = repr(float(v))
return len(r.split(".")[-1]) if "." in r else 0
def template_from(url: str) -> str:
"""Turn a captured sell-orders URL into a template with offset as '{}',
preserving every other param (the search/filter encoding we want to learn)."""
parts = urlsplit(url)
q = [(k, v) for k, v in parse_qsl(parts.query, keep_blank_values=True) if k != "offset"]
if not any(k == "limit" for k, _ in q):
q.append(("limit", "60"))
base_q = urlencode(q)
new_q = (base_q + "&" if base_q else "") + "offset={}"
return urlunsplit((parts.scheme, parts.netloc, parts.path, new_q, ""))
async def dismiss_consent(page) -> str | None:
"""Best-effort, privacy-preserving — never clicks 'Accept all'."""
for label in CONSENT_LABELS:
try:
el = await page.find(label, best_match=True, timeout=2)
except Exception:
el = None
if el:
try:
await el.click()
return label
except Exception:
pass
return None
async def fetch_json(page, url: str) -> tuple[str, str]:
expr = (
f"fetch({url!r}, {{credentials:'include', headers:{{'accept':'application/json'}}}})"
f".then(async r => JSON.stringify({{status: r.status, body: await r.text()}}))"
)
raw = await page.evaluate(expr, await_promise=True)
if not isinstance(raw, str):
return ("-1", "")
try:
obj = json.loads(raw)
return (str(obj.get("status", "-1")), obj.get("body", ""))
except json.JSONDecodeError:
return ("-1", raw)
async def main():
OUT_DIR.mkdir(exist_ok=True)
args = [f"--proxy-server={PROXY}"] if PROXY else []
target_url = MARKET_URL
tag = "market"
if SEARCH:
sep = "&" if "?" in MARKET_URL else "?"
target_url = f"{MARKET_URL}{sep}search={quote_plus(SEARCH)}"
tag = "search_" + "".join(c if c.isalnum() else "_" for c in SEARCH)[:40]
print(f"Launching nodriver Chromium (proxy={PROXY or 'none / own IP'})...")
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH, browser_args=args)
pages_ok = items_total = floats_total = low_prec = 0
dp_min, dp_max = 99, 0
deepest_offset = None
reason = "completed (hit PAGES limit)"
try:
# Open a blank tab first so the network handler is attached BEFORE the page
# fires its filtered sell-orders request (otherwise we'd miss it).
page = await browser.get("about:blank")
async def on_request(evt):
url = evt.request.url
if "/market/sell-orders" in url:
_seen_urls.append(url)
page.add_handler(cdp.network.RequestWillBeSent, on_request)
try:
await page.send(cdp.network.enable())
except Exception as ex:
print(f"(network capture unavailable: {ex})")
print(f"Opening {target_url}")
await page.get(target_url)
print(f"Solve any Cloudflare challenge. Waiting {SOLVE_SECONDS}s for the grid...")
await page.sleep(SOLVE_SECONDS)
clicked = await dismiss_consent(page)
print(f"Consent banner: {'dismissed via ' + clicked if clicked else 'left up (does not block fetch)'}")
# Reliable discovery via the Resource Timing API: the browser records EVERY
# request the page made, so we read the real sell-orders URL straight out of it
# (no flaky CDP event timing). Also dump nearby API calls for context.
# cs.money is an Astro SSR app — the initial filtered listings are rendered
# server-side (no client XHR to capture). Scroll to provoke lazy-load
# pagination, which DOES fire a client request carrying the real filter params.
print("Scrolling to trigger lazy-load pagination...")
for _ in range(6):
try:
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
except Exception:
pass
await page.sleep(2)
# nodriver returns arrays unreliably from evaluate(), so JSON.stringify in JS
# and json.loads here (the string path is proven by fetch_json).
async def js_list(expr: str) -> list:
raw = await page.evaluate(f"JSON.stringify({expr})")
try:
return json.loads(raw) if isinstance(raw, str) else []
except (json.JSONDecodeError, TypeError):
return []
try:
all_urls = await js_list("performance.getEntriesByType('resource').map(e=>e.name)")
print(f">>> Resource Timing saw {len(all_urls)} requests total")
if all_urls:
(OUT_DIR / "_all_requests.txt").write_text(
"\n".join(dict.fromkeys(all_urls)), encoding="utf-8")
sell = [u for u in all_urls if "/market/sell-orders" in u]
_seen_urls.extend(sell)
api = [u for u in all_urls if "cs.money/" in u and ("/2.0/" in u or "/1.0/" in u)]
if api:
(OUT_DIR / "_api_calls.txt").write_text("\n".join(dict.fromkeys(api)), encoding="utf-8")
print(f">>> {len(set(api))} cs.money API calls; saved to {OUT_DIR / '_api_calls.txt'}")
except Exception as ex:
print(f"(resource-timing query failed: {ex})")
# Dump the SSR'd page so we can see how the filter is encoded and where the
# listings data lives (Astro embeds island props / hydration JSON in the HTML).
try:
html = await page.evaluate("document.documentElement.outerHTML")
if isinstance(html, str) and html:
(OUT_DIR / "_page.html").write_text(html, encoding="utf-8")
print(f">>> saved page HTML ({len(html)} bytes) to {OUT_DIR / '_page.html'}")
except Exception as ex:
print(f"(page HTML dump failed: {ex})")
# Discovery: what sell-orders request did the page actually make?
if _seen_urls:
captured = _seen_urls[-1]
template = template_from(captured)
print("\n>>> DISCOVERED sell-orders API call the page fired:")
print(f" {captured}")
print(f">>> pagination template: {template}\n")
# Persist it — the console line is easy to lose, and this is the one bit
# of ground truth (the real filter-param scheme) we need.
(OUT_DIR / "_discovered.txt").write_text(
"ALL captured sell-orders requests:\n"
+ "\n".join(dict.fromkeys(_seen_urls))
+ f"\n\npagination template:\n{template}\n",
encoding="utf-8")
print(f">>> saved to {OUT_DIR / '_discovered.txt'}")
else:
template = DEFAULT_TEMPLATE
if SEARCH:
template = template.replace("offset={}", f"search={quote_plus(SEARCH)}&offset={{}}")
print(f"\n(no request captured; falling back to template: {template})\n")
for i in range(PAGES):
offset = START_OFFSET + i * 60
status, body = await fetch_json(page, template.format(offset))
if looks_like_challenge(body):
print(f" page {i + 1} [offset {offset}]: RE-CHALLENGED (status {status}). Stopping.")
(OUT_DIR / f"{tag}_challenge_offset_{offset}.html").write_text(body, encoding="utf-8")
reason = f"re-challenged at offset {offset}"
break
try:
items = json.loads(body).get("items", [])
except json.JSONDecodeError:
print(f" page {i + 1} [offset {offset}]: non-JSON (status {status}). Stopping.")
reason = f"non-JSON at offset {offset}"
break
if not items:
print(f" page {i + 1} [offset {offset}]: 0 items — end of results.")
reason = "end of results"
break
(OUT_DIR / f"{tag}_offset_{offset:06d}.json").write_text(body, encoding="utf-8")
pages_ok += 1
deepest_offset = offset
items_total += len(items)
names = set()
for it in items:
fl = it.get("asset", {}).get("float")
if fl is not None:
floats_total += 1
d = decimals(fl)
dp_min, dp_max = min(dp_min, d), max(dp_max, d)
if d <= 6: # short repr — exact binary fraction (e.g. 1/16), not truncation
low_prec += 1
names.add(it.get("asset", {}).get("names", {}).get("full"))
sample = next(iter(names), None) if SEARCH else None
print(f" page {i + 1} [offset {offset}] OK — {len(items)} items"
+ (f" (e.g. {sample}; {len(names)} distinct names)" if SEARCH else ""))
await page.sleep(DELAY + random.uniform(0, JITTER))
print("\n=== summary ===")
print(f" query: {SEARCH or '(whole market)'}")
print(f" stopped: {reason}")
print(f" clean pages: {pages_ok} deepest offset: {deepest_offset} items: {items_total}")
if floats_total:
# Truncation would make MANY values short, not one exact binary fraction.
verdict = "FULL precision" if low_prec / floats_total < 0.02 else "POSSIBLE TRUNCATION"
print(f" floats: {floats_total} items, {dp_max}-decimal max, "
f"{low_prec} short-repr (exact fractions) — {verdict}")
print(f" files in {OUT_DIR}")
finally:
browser.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())

View File

@@ -1,77 +0,0 @@
"""
Probe which extra filter params cs.money's SSR market search honors, so we can
pick a SECOND pagination axis to break apart dense price bands that saturate the
60-cap (see diag_windows.py). For a saturating search we try candidate params and
report how the returned set's size + float range + price range change.
python probe_filters.py "Glock-18 Candy Apple mw"
"""
import asyncio
import sys
import nodriver as uc
import worker
BASE = "https://cs.money/market/buy/?search={q}"
# (label, extra query string) — candidates cs.money markets commonly expose.
CANDIDATES = [
("baseline", ""),
("sort=price asc", "&order=asc&sort=price"),
("sort=price desc", "&order=desc&sort=price"),
("sort=float", "&sort=float"),
("minFloat/maxFloat lo", "&minFloat=0.07&maxFloat=0.10"),
("minFloat/maxFloat hi", "&minFloat=0.10&maxFloat=0.15"),
("maxWear lo", "&minWear=0.07&maxWear=0.10"),
("isStatTrak=true", "&isStatTrak=true"),
("hasStickers=false", "&hasStickers=false"),
]
def stats(items):
floats = [(((it.get("asset") or {}).get("float"))) for it in items]
floats = [f for f in floats if isinstance(f, (int, float))]
bases = []
for it in items:
p = it.get("pricing") or {}
b = p.get("basePrice", p.get("computed"))
if isinstance(b, (int, float)):
bases.append(b)
fr = f"[{min(floats):.4f},{max(floats):.4f}]" if floats else "[-]"
br = f"[{min(bases):.2f},{max(bases):.2f}]" if bases else "[-]"
return f"n={len(items):3d} float{fr} base{br}"
async def main():
search = " ".join(sys.argv[1:]) or "Glock-18 Candy Apple mw"
q = worker.urllib.parse.quote_plus(search)
args = ["--blink-settings=imagesEnabled=false"]
browser = await uc.start(headless=False, browser_args=args)
try:
page = await browser.get("about:blank")
await worker.warm(page)
base_ids = None
for label, extra in CANDIDATES:
url = BASE.format(q=q) + extra
status, body = await worker.fetch_json(page, url)
if "Just a moment" in body or "challenge-platform" in body:
print(f" {label:24s} CHALLENGED"); break
items = worker.extract_items(body)
ids = {it.get("id") for it in items}
if label == "baseline":
base_ids = ids
delta = ""
else:
# If a param is IGNORED, the set is identical to baseline.
delta = "IGNORED (== baseline)" if ids == base_ids else f"CHANGED ({len(ids ^ (base_ids or set()))} diff ids)"
print(f" {label:24s} {stats(items)} {delta}")
await page.sleep(worker.DELAY)
finally:
browser.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())

View File

@@ -1,5 +1,9 @@
# cs.money scraping worker.
# Market scraping workers (cs.money, skin.land).
# nodriver = the modern successor to undetected-chromedriver: it drives a normal
# Chromium over CDP directly (no chromedriver, so none of the cdc_/webdriver tells
# that got our .NET Selenium setup insta-challenged by Cloudflare).
nodriver>=0.39
#
# Everything else the workers use is the Python stdlib (asyncio, urllib, logging, json) —
# no other third-party deps. Upper bound is a guard against a surprise breaking release;
# bump it deliberately after testing a challenge solve.
nodriver>=0.39,<0.50

174
worker/skinland_worker.py Normal file
View File

@@ -0,0 +1,174 @@
"""skin.land scrape worker (pull model).
A thin strategy over blworker.Worker, mirroring the cs.money worker — it supplies only the
skin.land-specific bits; the warm session, poll/scrape/post loop, IPRoyal proxy, IP
rotation, logging and shutdown all live in the shared runtime. Env knobs: worker/README.md.
How skin.land is scraped (learned from the discovery probes):
- A job's target is the market PAGE URL, e.g.
https://skin.land/market/csgo/ak-47-redline-field-tested/
- That Nuxt page embeds an internal numeric skin_id. We resolve it once from the page's
__NUXT__ payload (the skin object whose `url` == the page slug), cache it per slug, then
page the clean JSON API:
GET https://app.skin.land/api/v2/obtained-skins?skin_id={id}&page={n}
which returns a Laravel paginator {data:[...offers], meta:{current_page,last_page,…}}.
- We walk pages 1..last_page (capped by the job's maxPages), dedup offers by id, and post.
cd worker
.venv\\Scripts\\Activate.ps1
pip install -r requirements.txt
python skinland_worker.py
"""
import json
import re
from blworker import ScrapeResult, Worker, click, looks_like_challenge, page_fetch, run
# The offers API. skin_id is skin.land's internal id (resolved from the page); page is the
# Laravel paginator page. Same warm session, fetched in-page (CORS-allowed app subdomain).
API = "https://app.skin.land/api/v2/obtained-skins?skin_id={skin_id}&page={page}"
# The page's Nuxt payload is a devalue flat array; the main skin object is the one whose
# `url` field resolves to the page slug, and its `id` field resolves to the skin_id.
NUXT_ARRAY_RE = re.compile(r'\[\["(?:ShallowReactive|Reactive)",\d+\]')
def slug_of(url: str) -> str:
return url.rstrip("/").rsplit("/", 1)[-1]
def extract_nuxt_array(html: str):
"""Pull the Nuxt devalue payload (a JSON flat array of values with index references)
out of the page HTML. Returns the parsed list, or None."""
m = NUXT_ARRAY_RE.search(html)
if not m:
return None
start = m.start()
depth = 0
instr = False
esc = False
for i in range(start, len(html)):
ch = html[i]
if esc:
esc = False
continue
if ch == "\\":
esc = True
continue
if ch == '"':
instr = not instr
continue
if instr:
continue
if ch == "[":
depth += 1
elif ch == "]":
depth -= 1
if depth == 0:
try:
return json.loads(html[start:i + 1])
except json.JSONDecodeError:
return None
return None
def resolve_skin_id(html: str, slug: str) -> int | None:
"""Find the page's main skin object in the Nuxt payload — the dict whose `url` field
resolves to the page slug — and return its resolved `id` (skin.land's internal skin_id
used by the obtained-skins API)."""
arr = extract_nuxt_array(html)
if not arr:
return None
def val(ref):
return arr[ref] if isinstance(ref, int) and 0 <= ref < len(arr) else ref
for el in arr:
if isinstance(el, dict) and "url" in el and "id" in el and val(el["url"]) == slug:
sid = val(el["id"])
if isinstance(sid, int):
return sid
return None
class SkinLandWorker(Worker):
name = "skinland"
jobs_path = "/skinland/jobs"
default_market_url = "https://skin.land/market/csgo/"
def __init__(self, settings):
super().__init__(settings)
# skin_id is stable per skin+wear, so cache it per slug to skip the ~page fetch on
# re-sweeps.
self._skin_id_cache: dict[str, int] = {}
def describe_job(self, job) -> str:
return slug_of(job["url"])
async def dismiss_consent(self, page) -> str | None:
"""Privacy-preserving: dismiss the cookie banner with essential-only if present."""
for label in ("Accept essential", "ACCEPT ESSENTIAL", "Reject all"):
if await click(page, label):
return f"dismissed via {label!r}"
return None
async def _get_skin_id(self, page, job, slug: str) -> tuple[int | None, str, int]:
"""Resolve (and cache) skin.land's skin_id for this slug. Returns
(skin_id, reason, wire); reason is "" on success, else a partial-stop reason."""
if slug in self._skin_id_cache:
return self._skin_id_cache[slug], "", 0
_status, html, wire = await page_fetch(page, job["url"], accept="text/html")
if looks_like_challenge(html):
return None, "challenged", max(wire, 0)
skin_id = resolve_skin_id(html, slug)
if skin_id is None:
return None, "no-skin-id", max(wire, 0)
self._skin_id_cache[slug] = skin_id
return skin_id, "", max(wire, 0)
async def scrape_job(self, page, job) -> ScrapeResult:
"""Scrape ALL offers for one skin+wear by paging the obtained-skins API."""
slug = slug_of(job["url"])
max_pages = job.get("maxPages", 40)
skin_id, reason, wire = await self._get_skin_id(page, job, slug)
if skin_id is None:
return ScrapeResult([], 0, reason, wire)
seen: dict = {}
fetches = 0
page_n = 1
reason = "completed"
while page_n <= max_pages:
_status, body, wbytes = await page_fetch(page, API.format(skin_id=skin_id, page=page_n))
fetches += 1
if wbytes > 0:
wire += wbytes
if looks_like_challenge(body):
return ScrapeResult(list(seen.values()), fetches, "challenged", wire)
try:
payload = json.loads(body)
except json.JSONDecodeError:
return ScrapeResult(list(seen.values()), fetches, "bad-json", wire)
for o in payload.get("data") or []:
if o.get("id") is not None:
seen[o["id"]] = o
meta = payload.get("meta") or {}
last = meta.get("last_page")
if not payload.get("data") or (isinstance(last, int) and page_n >= last):
break # walked the final page
page_n += 1
await self._pace(page)
else:
reason = "fetch-cap"
return ScrapeResult(list(seen.values()), fetches, reason, wire)
if __name__ == "__main__":
run(SkinLandWorker)

View File

@@ -1,77 +0,0 @@
"""
One-off count verification: scrape a single skin+wear search from cs.money and
report how many distinct sell-orders come back, reusing the production worker's
warm-session + price-window bisection logic (worker.scrape_job).
Use it to sanity-check that our pagination actually recovers the FULL listing
count cs.money shows on the site (the known ground truth) for one query.
cd worker
.venv\\Scripts\\Activate.ps1
python verify_count.py "Desert Eagle Bronze Deco fn"
Env knobs (same meaning as worker.py): SOLVE_SECONDS, DELAY, JITTER, PROXY,
BROWSER_PATH, LOAD_IMAGES. MAX_FETCHES caps window fetches (default 80).
"""
import asyncio
import os
import sys
from collections import Counter
import nodriver as uc
import worker
MAX_FETCHES = int(os.environ.get("MAX_FETCHES", "80"))
async def main():
search = " ".join(sys.argv[1:]) or "Desert Eagle Bronze Deco fn"
args = [f"--proxy-server={worker.PROXY}"] if worker.PROXY else []
if not worker.LOAD_IMAGES:
args.append("--blink-settings=imagesEnabled=false")
if os.environ.get("CHROME_NO_SANDBOX") == "1":
args += ["--no-sandbox", "--disable-dev-shm-usage"]
print(f"Verifying count for search {search!r} (proxy={worker.PROXY or 'own IP'})")
browser = await uc.start(
headless=False, browser_executable_path=worker.BROWSER_PATH, browser_args=args)
try:
page = await browser.get("about:blank")
await worker.warm(page)
job = {"search": search, "maxPages": MAX_FETCHES}
items, fetches, reason = await worker.scrape_job(page, job)
print("\n=== result ===")
print(f" search: {search}")
print(f" stopped: {reason}")
print(f" fetches: {fetches}")
print(f" DISTINCT sell-orders (deduped by id): {len(items)}")
# Break down what came back so we can see whether the count is inflated by
# off-target names/wears (the C2's name+wear filter would drop those later).
names = Counter()
wears = Counter()
st = 0
for it in items:
asset = it.get("asset") or {}
names[(asset.get("names") or {}).get("full")] += 1
wears[asset.get("quality")] += 1
if asset.get("isStatTrak"):
st += 1
print(f" StatTrak in set: {st}")
print(" by name:")
for name, n in names.most_common():
print(f" {n:4d} {name}")
print(" by wear (quality code):")
for w, n in wears.most_common():
print(f" {n:4d} {w}")
finally:
browser.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())

View File

@@ -1,79 +0,0 @@
"""
Validate the float-cursor scrape by walking the float axis in BOTH directions and
comparing the recovered sell-order id sets. If ascending (lowest float first) and
descending (highest float first) independently land on the same listings, the
cursor is exhaustive and order-independent — i.e. the count is real, not an artifact
of walk direction or boundary double-counting.
python verify_crosscheck.py "Glock-18 Candy Apple mw"
"""
import asyncio
import sys
import nodriver as uc
import worker
CAP = worker.PAGE_CAP
ASC = ("https://cs.money/market/buy/?search={q}"
"&order=asc&sort=float&minFloat={cur:.12f}&maxFloat=1")
DESC = ("https://cs.money/market/buy/?search={q}"
"&order=desc&sort=float&minFloat=0&maxFloat={cur:.12f}")
async def walk(page, q, template, ascending, max_fetches=60):
seen = {}
cur = 0.0 if ascending else 1.0
fetches = 0
while fetches < max_fetches:
status, body = await worker.fetch_json(page, template.format(q=q, cur=cur))
fetches += 1
if "Just a moment" in body or "challenge-platform" in body:
return seen, fetches, "challenged"
items = worker.extract_items(body)
floats = []
for it in items:
if it.get("id") is not None:
seen[it["id"]] = it
fl = (it.get("asset") or {}).get("float")
if isinstance(fl, (int, float)):
floats.append(fl)
if len(items) < CAP:
return seen, fetches, "completed"
nxt = (max(floats) if ascending else min(floats)) if floats else None
if nxt is None or (ascending and nxt <= cur) or (not ascending and nxt >= cur):
return seen, fetches, "stuck"
cur = nxt
await page.sleep(worker.DELAY)
return seen, fetches, "fetch-cap"
async def main():
search = " ".join(sys.argv[1:]) or "Glock-18 Candy Apple mw"
q = worker.urllib.parse.quote_plus(search)
browser = await uc.start(headless=False, browser_args=["--blink-settings=imagesEnabled=false"])
try:
page = await browser.get("about:blank")
await worker.warm(page)
asc, fa, ra = await walk(page, q, ASC, ascending=True)
print(f"ASC : {len(asc):4d} ids {fa} fetches {ra}")
desc, fd, rd = await walk(page, q, DESC, ascending=False)
print(f"DESC: {len(desc):4d} ids {fd} fetches {rd}")
a, d = set(asc), set(desc)
union = a | d
print("\n=== cross-check ===")
print(f" ASC only: {len(a - d)}")
print(f" DESC only: {len(d - a)}")
print(f" in both: {len(a & d)}")
print(f" UNION (distinct):{len(union)}")
agree = "AGREE — count is solid" if a == d else "DISAGREE — one walk missed listings"
print(f" verdict: {agree}")
finally:
browser.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())

View File

@@ -1,483 +0,0 @@
"""
cs.money scrape worker (pull model).
Holds ONE warm nodriver session (the thing that beats Cloudflare), then loops:
poll the .NET C2 for a job, scrape that skin+wear's sell-orders via in-page fetch
from the cleared session, and post the results back. The C2 owns job selection
(stalest skin+wear first) and persistence; this worker just fetches and forwards.
cd worker
.venv\\Scripts\\Activate.ps1
pip install -r requirements.txt
python worker.py
Env knobs:
C2_URL C2 base URL (default http://localhost:5080)
WORKER_TOKEN shared secret, must match the C2's WorkerToken (default dev-worker-token)
MARKET_URL market page to warm the session on (default the buy market)
SOLVE_SECONDS seconds to clear Cloudflare on startup (default 30)
DELAY / JITTER base + random seconds between page fetches (default 2.0 / 1.5)
IDLE_SECONDS sleep when the C2 has no work (default 10)
BROWSER_PATH path to Chrome/Edge if auto-detect fails
Proxy (pick one; IPRoyal takes priority when its creds are set):
IPROYAL_USERNAME IPRoyal residential account username
IPROYAL_PASSWORD IPRoyal residential account password
IPROYAL_COUNTRY ISO country for the exit (default us; blank = any)
IPROYAL_LIFETIME_MIN sticky-IP hold in minutes (default 60)
PROXY host:port for an auth-free proxy (fallback; omit to use your own IP)
Each worker process mints its own random IPRoyal sticky session at startup, so N
workers get N distinct residential exit IPs with no coordination — scale with
`docker compose up --scale worker=N`. On a Cloudflare challenge the worker rotates
to a fresh session (new IP) and re-warms. Chromium can't carry proxy credentials on
--proxy-server, so we run a tiny in-process forwarder (LocalForwardingProxy below)
that injects the IPRoyal auth and chains to the gateway; Chrome talks only to an
auth-free 127.0.0.1 endpoint, keeping us at zero CDP (a CDP auth handler is a
Cloudflare tell).
"""
import asyncio
import base64
import json
import os
import random
import re
import urllib.error
import urllib.parse
import urllib.request
import uuid
import nodriver as uc
C2_URL = os.environ.get("C2_URL", "http://localhost:5080").rstrip("/")
TOKEN = os.environ.get("WORKER_TOKEN", "dev-worker-token")
MARKET_URL = os.environ.get("MARKET_URL", "https://cs.money/market/buy/")
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
DELAY = float(os.environ.get("DELAY", "2.0"))
JITTER = float(os.environ.get("JITTER", "1.5"))
IDLE_SECONDS = int(os.environ.get("IDLE_SECONDS", "10"))
PROXY = os.environ.get("PROXY")
BROWSER_PATH = os.environ.get("BROWSER_PATH")
# IPRoyal residential gateway. One fixed host/port; country, sticky-session id and
# lifetime are encoded as underscore params appended to the password (see
# _iproyal_password). Mirrors the .NET IpRoyalProxyProvider scheme.
IPROYAL_HOST = os.environ.get("IPROYAL_HOST", "geo.iproyal.com")
IPROYAL_PORT = int(os.environ.get("IPROYAL_PORT", "12321"))
IPROYAL_USERNAME = os.environ.get("IPROYAL_USERNAME")
IPROYAL_PASSWORD = os.environ.get("IPROYAL_PASSWORD")
IPROYAL_COUNTRY = os.environ.get("IPROYAL_COUNTRY", "us").strip().lower()
IPROYAL_LIFETIME_MIN = int(os.environ.get("IPROYAL_LIFETIME_MIN", "60"))
# Residential proxy is metered per GB. Cloudflare gates on JS, not images, and the
# sell-orders API is pure JSON — so block images by default to slash page-render
# bandwidth. Set LOAD_IMAGES=1 to re-enable (e.g. for debugging the visible page).
LOAD_IMAGES = os.environ.get("LOAD_IMAGES") == "1"
# cs.money is an Astro SSR app: the free-text market search filters server-side and
# the resulting listings are embedded in the page as a __page-params JSON blob. The
# /2.0/market/sell-orders API rejects a `search` param (HTTP 400), so we fetch the
# PAGE for a search and read the embedded items — same item shape as the API.
#
# A page returns at most 60 and offset is ignored, so we paginate with a FORWARD
# CURSOR on float: cs.money honors `order=asc&sort=float` + `minFloat`, and float is
# full-precision and effectively unique per item. We grab the 60 lowest-float items
# at/above `lo`, advance `lo` to the highest float returned, and repeat until a page
# is under the cap. (The old minPrice/maxPrice bisection silently truncated cheap
# skins: >60 listings can share a sub-$0.02 reference band, which no price window can
# split — floats almost never tie, so the cursor always makes progress.)
PAGE = ("https://cs.money/market/buy/?search={search}"
"&order=asc&sort=float&minFloat={lo:.12f}&maxFloat=1")
PAGE_CAP = 60 # items per SSR page
PAGE_PARAMS_RE = re.compile(
r'<script\b[^>]*id="__page-params"[^>]*>(.*?)</script>', re.S)
# --- IPRoyal residential proxy ----------------------------------------------------
def _new_session_id() -> str:
"""Short, opaque, URL-safe token. IPRoyal pins one residential exit IP per
distinct session value, so a fresh id == a fresh IP."""
return uuid.uuid4().hex[:10]
def _iproyal_password(session_id: str) -> str:
"""Bake the targeting/session knobs onto the account password, IPRoyal-style:
"<pass>_country-us_session-<id>_lifetime-60m". Country is optional."""
pw = IPROYAL_PASSWORD
if IPROYAL_COUNTRY:
pw += f"_country-{IPROYAL_COUNTRY}"
pw += f"_session-{session_id}_lifetime-{IPROYAL_LIFETIME_MIN}m"
return pw
class LocalForwardingProxy:
"""In-process HTTP proxy on 127.0.0.1 that chains every connection to the IPRoyal
gateway, injecting the Proxy-Authorization header itself. Chromium ignores creds in
--proxy-server and the in-browser ways to answer the gateway's 407 (a CDP auth
handler, or a disabled MV2 extension) are Cloudflare tells — so we terminate the
browser->proxy hop locally and add auth here, leaving Chrome to talk to an auth-free
endpoint at zero CDP. HTTPS (all cs.money serves) flows through the CONNECT tunnel,
so this proxy only relays ciphertext and never sees plaintext. Ported from the .NET
LocalForwardingProxy. The active session token can be swapped live (set_password) to
move to a fresh exit IP without restarting the browser. (New tunnels pick up the new
IP; any still-open keep-alive tunnel stays on the old one until it closes.)"""
def __init__(self, host: str, port: int, username: str, password: str):
self._host = host
self._port = port
self._username = username
self._password = password
self._server: asyncio.AbstractServer | None = None
self.endpoint = ""
def set_password(self, password: str) -> None:
self._password = password
def _auth_header(self) -> str:
token = base64.b64encode(f"{self._username}:{self._password}".encode()).decode()
return f"Proxy-Authorization: Basic {token}\r\n"
async def start(self) -> "LocalForwardingProxy":
self._server = await asyncio.start_server(self._handle, "127.0.0.1", 0)
port = self._server.sockets[0].getsockname()[1]
self.endpoint = f"127.0.0.1:{port}"
return self
async def stop(self) -> None:
if self._server is not None:
self._server.close()
try:
await self._server.wait_closed()
except Exception:
pass
@staticmethod
async def _read_header(reader: asyncio.StreamReader) -> str | None:
"""Read up to the end of the HTTP header block (CRLFCRLF). None on EOF/overflow."""
try:
data = await reader.readuntil(b"\r\n\r\n")
except (asyncio.IncompleteReadError, asyncio.LimitOverrunError):
return None
return data.decode("latin-1")
async def _handle(self, client_reader: asyncio.StreamReader, client_writer: asyncio.StreamWriter) -> None:
up_writer: asyncio.StreamWriter | None = None
try:
header = await self._read_header(client_reader)
if not header:
return
parts = header.split("\r\n", 1)[0].split(" ")
if len(parts) < 2:
return
method, target = parts[0], parts[1]
up_reader, up_writer = await asyncio.open_connection(self._host, self._port)
if method.upper() == "CONNECT":
# HTTPS: open an authenticated tunnel upstream, then relay raw bytes.
up_writer.write(
f"CONNECT {target} HTTP/1.1\r\nHost: {target}\r\n{self._auth_header()}\r\n".encode())
await up_writer.drain()
up_header = await self._read_header(up_reader)
status = up_header.split(" ", 2) if up_header else []
if len(status) < 2 or status[1] != "200":
line = (up_header or "no response").split("\r\n", 1)[0]
print(f" proxy: upstream refused CONNECT {target}: {line}")
client_writer.write(b"HTTP/1.1 502 Bad Gateway\r\nConnection: close\r\n\r\n")
await client_writer.drain()
return
client_writer.write(b"HTTP/1.1 200 Connection established\r\n\r\n")
await client_writer.drain()
else:
# Plain HTTP: re-inject the request upstream with auth, then relay.
idx = header.index("\r\n") + 2
up_writer.write((header[:idx] + self._auth_header() + header[idx:]).encode())
await up_writer.drain()
await self._relay(client_reader, client_writer, up_reader, up_writer)
except Exception:
pass # one bad tunnel must never take down the listener
finally:
for w in (client_writer, up_writer):
if w is not None:
try:
w.close()
except Exception:
pass
@staticmethod
async def _relay(
client_reader: asyncio.StreamReader, client_writer: asyncio.StreamWriter,
up_reader: asyncio.StreamReader, up_writer: asyncio.StreamWriter) -> None:
# Pipe both directions, but tear the whole tunnel down as soon as EITHER side
# closes (mirrors the .NET WhenAny). Waiting for both — as a plain gather does —
# leaks a task holding two sockets on every half-closed connection, which piles
# up fast across a long multi-worker run. Closing both writers when the first
# pipe finishes unblocks the other's pending read so both tasks settle.
async def pipe(reader: asyncio.StreamReader, writer: asyncio.StreamWriter) -> None:
try:
while data := await reader.read(65536):
writer.write(data)
await writer.drain()
except Exception:
pass
a = asyncio.create_task(pipe(client_reader, up_writer))
b = asyncio.create_task(pipe(up_reader, client_writer))
try:
await asyncio.wait({a, b}, return_when=asyncio.FIRST_COMPLETED)
finally:
for w in (client_writer, up_writer):
try:
w.close()
except Exception:
pass
await asyncio.gather(a, b, return_exceptions=True)
def looks_like_challenge(body: str) -> bool:
s = (body or "").lstrip()
return not s or s.startswith("<") or "Just a moment" in body or "challenge-platform" in body
# --- C2 HTTP (stdlib, run off the event loop) -------------------------------------
def _get_job_sync():
req = urllib.request.Request(f"{C2_URL}/jobs/next", headers={"X-Worker-Token": TOKEN})
try:
with urllib.request.urlopen(req, timeout=15) as r:
if r.status == 204:
return None
return json.loads(r.read() or b"null")
except urllib.error.HTTPError as e:
print(f" C2 /jobs/next -> HTTP {e.code}")
return None
except urllib.error.URLError as e:
print(f" C2 unreachable: {e}")
return None
def _post_result_sync(job_id: str, payload: dict):
data = json.dumps(payload).encode()
req = urllib.request.Request(
f"{C2_URL}/jobs/{job_id}/result", data=data, method="POST",
headers={"X-Worker-Token": TOKEN, "Content-Type": "application/json"})
try:
with urllib.request.urlopen(req, timeout=60) as r:
return json.loads(r.read() or b"null")
except urllib.error.HTTPError as e:
print(f" C2 result -> HTTP {e.code}: {e.read()[:200]!r}")
return None
except urllib.error.URLError as e:
print(f" C2 unreachable posting result: {e}")
return None
async def get_job():
return await asyncio.to_thread(_get_job_sync)
async def post_result(job_id, payload):
return await asyncio.to_thread(_post_result_sync, job_id, payload)
# --- scraping ---------------------------------------------------------------------
async def fetch_json(page, url: str) -> tuple[str, str, int]:
"""Fetch in-page and also read back the Resource Timing transferSize — the actual
COMPRESSED bytes on the wire (what the metered proxy bills), not len(body) which is
the decompressed size. Returns (status, body, wire_bytes); wire_bytes is -1 if the
timing entry wasn't available. Same-origin (cs.money), so the size fields are exposed."""
expr = (
f"fetch({url!r}, {{credentials:'include', headers:{{'accept':'application/json'}}}})"
f".then(async r => {{"
f" const body = await r.text();"
f" const e = performance.getEntriesByName({url!r}).slice(-1)[0];"
f" return JSON.stringify({{status: r.status, body: body,"
f" wire: e ? e.transferSize : -1, dec: e ? e.decodedBodySize : -1}});"
f"}})"
)
raw = await page.evaluate(expr, await_promise=True)
if not isinstance(raw, str):
return ("-1", "", -1)
try:
obj = json.loads(raw)
return (str(obj.get("status", "-1")), obj.get("body", ""), int(obj.get("wire", -1)))
except (json.JSONDecodeError, ValueError, TypeError):
return ("-1", raw, -1)
async def _click(page, text, timeout=3):
try:
el = await page.find(text, best_match=True, timeout=timeout)
if el:
await el.click()
return True
except Exception:
pass
return False
async def dismiss_consent(page):
"""Privacy-preserving. The banner only offers 'Accept all' / 'Manage cookies';
the Reject-all control lives inside the Manage window. So: Manage -> Reject all ->
Confirm. (The data path reads SSR __page-params regardless, but this keeps the
session honest and unblocks any future interaction.)"""
steps = []
if await _click(page, "Manage cookies") or await _click(page, "Manage"):
await page.sleep(1)
if await _click(page, "Reject all"):
steps.append("reject-all")
for c in ("Confirm my choice", "Confirm", "Save"):
if await _click(page, c):
steps.append(f"confirm:{c}")
break
return ", ".join(steps) if steps else None
async def warm(page):
"""Open the market and clear Cloudflare so the session holds cf_clearance."""
print(f"Warming session at {MARKET_URL} (clear Cloudflare; {SOLVE_SECONDS}s)...")
await page.get(MARKET_URL)
await page.sleep(SOLVE_SECONDS)
clicked = await dismiss_consent(page)
print(f"Consent: {'dismissed via ' + clicked if clicked else 'left up'}")
def extract_items(html: str) -> list:
"""Pull inventory.items out of the page's __page-params JSON blob."""
m = PAGE_PARAMS_RE.search(html)
if not m:
return []
try:
return json.loads(m.group(1)).get("inventory", {}).get("items", []) or []
except json.JSONDecodeError:
return []
async def scrape_job(page, job) -> tuple[list, int, str, int]:
"""Scrape ALL listings for one skin+wear via a forward float cursor.
A search page returns at most 60 items and ignores offset, but cs.money sorts by
float (order=asc&sort=float) and filters by minFloat. So we walk the float axis:
grab the 60 lowest-float items at/above `lo`, advance `lo` to the highest float on
the page, and repeat until a page is under the cap. The boundary item is re-fetched
(minFloat is inclusive) and dropped by the id dedup. Returns
(items, fetches, reason, wire_bytes) where wire_bytes is the metered (compressed) cost.
"""
search = urllib.parse.quote_plus(job["search"])
max_fetches = job.get("maxPages", 40) # safety cap on page fetches per job
seen: dict = {}
fetches = 0
wire = 0
lo = 0.0
reason = "completed"
while fetches < max_fetches:
status, body, wbytes = await fetch_json(page, PAGE.format(search=search, lo=lo))
fetches += 1
if wbytes > 0:
wire += wbytes
if "Just a moment" in body or "challenge-platform" in body:
return list(seen.values()), fetches, "challenged", wire
items = extract_items(body)
floats = []
for it in items:
if it.get("id") is not None:
seen[it["id"]] = it
fl = (it.get("asset") or {}).get("float")
if isinstance(fl, (int, float)):
floats.append(fl)
if len(items) < PAGE_CAP:
break # last page — fewer than the cap means we've seen everything
# Advance the cursor past the highest float on this page. Items at exactly that
# float are re-fetched next round (minFloat is inclusive) and deduped by id.
nxt = max(floats) if floats else None
if nxt is None or nxt <= lo:
# Cursor can't advance: >60 listings share a single float value, or the
# items carry no float. Bail loudly rather than spin — a flagged gap beats
# a silent one (this is the failure the price-window version hid).
reason = "stuck-float-tie"
break
lo = nxt
await page.sleep(DELAY + random.uniform(0, JITTER))
else:
reason = "fetch-cap"
return list(seen.values()), fetches, reason, wire
async def main():
# IPRoyal (auth'd, per-worker sticky IP) takes priority; else a plain auth-free
# PROXY; else this host's own IP. The forwarder injects IPRoyal auth so Chrome
# only ever sees an auth-free 127.0.0.1 endpoint.
forwarder = None
session_id = None
if IPROYAL_USERNAME and IPROYAL_PASSWORD:
session_id = _new_session_id()
forwarder = await LocalForwardingProxy(
IPROYAL_HOST, IPROYAL_PORT, IPROYAL_USERNAME, _iproyal_password(session_id)).start()
proxy = forwarder.endpoint
proxy_label = f"iproyal[{IPROYAL_COUNTRY or 'any'}] session {session_id} via {forwarder.endpoint}"
else:
proxy = PROXY
proxy_label = PROXY or "own IP"
args = [f"--proxy-server={proxy}"] if proxy else []
if not LOAD_IMAGES:
# Disable image loading at the engine level — the dominant bandwidth cost on
# an image-heavy market, and unneeded for CF clearance or the JSON API.
args.append("--blink-settings=imagesEnabled=false")
if os.environ.get("CHROME_NO_SANDBOX") == "1":
# Required when running Chromium as root in a container.
args += ["--no-sandbox", "--disable-dev-shm-usage"]
print(f"Starting worker (C2={C2_URL}, proxy={proxy_label}, images={'on' if LOAD_IMAGES else 'off'})...")
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH, browser_args=args)
try:
page = await browser.get("about:blank")
await warm(page)
total_wire = 0 # metered (compressed) bytes this worker has pulled, lifetime
while True:
job = await get_job()
if not job:
await asyncio.sleep(IDLE_SECONDS)
continue
print(f"Job {job['jobId'][:8]} — search {job['search']!r}")
items, pages, reason, wire = await scrape_job(page, job)
total_wire += wire
if reason == "challenged":
# The exit IP is likely flagged. On IPRoyal, rotate to a fresh sticky
# session (new IP) before re-warming; otherwise just re-solve in place.
if forwarder is not None:
session_id = _new_session_id()
forwarder.set_password(_iproyal_password(session_id))
print(f" challenged; rotating exit IP -> session {session_id}, re-warming...")
else:
print(" re-challenged; re-warming session...")
await warm(page)
result = await post_result(job["jobId"], {
"items": items, "pages": pages, "stoppedReason": reason})
summary = (f"matched {result.get('matched')}, new {result.get('inserted')}, "
f"upd {result.get('updated')}, removed {result.get('removed')}") if result else "post failed"
wire_kb = wire / 1024
print(f" scraped {len(items)} items ({pages}p, {reason}, {wire_kb:.0f}KB wire) "
f"-> {summary} [lifetime {total_wire / 1_048_576:.1f}MB]")
await page.sleep(DELAY + random.uniform(0, JITTER))
finally:
browser.stop()
if forwarder is not None:
await forwarder.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())