Add cs.money worker stack with per-worker IPRoyal residential proxy

Brings up the pull-model scraper: the .NET C2 hands skin+wear jobs to Python nodriver workers that scrape cs.money and post results back, plus the supporting Core/EFCore data model, migrations, and docker-compose orchestration.

IPRoyal proxying lets workers scale horizontally with a distinct residential exit IP each: every worker process mints its own sticky session at startup, and an in-process forwarding proxy injects the gateway auth so Chromium talks only to an auth-free localhost endpoint (zero CDP). On a Cloudflare challenge a worker rotates to a fresh session/IP and re-warms. Verified end-to-end against live IPRoyal: distinct US residential exits per worker and IP rotation on demand.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
bob
2026-05-31 15:03:31 -05:00
parent eb5fb0dac7
commit dc7c3f99ae
82 changed files with 8354 additions and 571 deletions

3
worker/.gitattributes vendored Normal file
View File

@@ -0,0 +1,3 @@
# entrypoint.sh runs in a Linux container — keep LF so the shebang isn't broken by
# Windows CRLF conversion.
*.sh text eol=lf

3
worker/.gitignore vendored Normal file
View File

@@ -0,0 +1,3 @@
.venv/
__pycache__/
captures/

35
worker/Dockerfile Normal file
View File

@@ -0,0 +1,35 @@
# cs.money worker: headful Chromium (nodriver) under a virtual display, with noVNC
# so you can open a browser into the container and solve a Cloudflare challenge by hand
# if one ever appears. Build context is the repo root (see docker-compose.yml).
FROM python:3.13-slim
# chromium + a virtual X display + VNC bridge + the fonts/libs Chromium needs.
RUN apt-get update && apt-get install -y --no-install-recommends \
chromium \
xvfb \
x11vnc \
novnc \
websockify \
ca-certificates \
fonts-liberation \
dumb-init \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY worker/requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY worker/worker.py worker/entrypoint.sh ./
RUN chmod +x entrypoint.sh
ENV BROWSER_PATH=/usr/bin/chromium \
CHROME_NO_SANDBOX=1 \
DISPLAY=:99 \
SOLVE_SECONDS=45 \
PYTHONUNBUFFERED=1
# noVNC web UI (browse http://localhost:6080/vnc.html to watch / solve a challenge).
EXPOSE 6080
# dumb-init reaps the Xvfb/x11vnc/websockify children cleanly.
ENTRYPOINT ["dumb-init", "--", "./entrypoint.sh"]

72
worker/README.md Normal file
View File

@@ -0,0 +1,72 @@
# cs.money worker (Python)
The browser/Cloudflare layer for the cs.money scraper. .NET stays the **C2**
(orchestration, proxy/IP allocation, DB, the sweep loop); this worker is the only
component that drives a browser and defeats Cloudflare, because the effective
anti-bot tooling (`nodriver`/`undetected-chromedriver`, TLS impersonation) only
exists in Python/Go, not .NET.
## Why nodriver
.NET Selenium got insta-challenged by Cloudflare's managed challenge because
`msedgedriver` controls the browser via the DevTools protocol, leaving `navigator.
webdriver` and chromedriver `cdc_` artifacts that Cloudflare keys on. `nodriver`
drives a normal Chromium directly over CDP (no chromedriver) and patches those
tells, so it passes where Selenium loops.
## Step 1: prove it (current)
`poc.py` proves nodriver can clear cs.money's Cloudflare and fetch the listings API
before we build the full pull-based fleet.
```powershell
cd worker
py -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
python poc.py
```
A Chromium window opens on the market. Solve the Cloudflare check if shown; the
script waits, then pages `sell-orders` deeply (PAGES), reporting how far the warm
session survives before any re-challenge and confirming full float precision.
Output lands in `worker/captures/`.
**Targeted skin+wear search.** cs.money search is free-text on the page
(`?search=cyber+security+ft`). Set `SEARCH` and the PoC navigates there, **captures
the actual filtered `sell-orders` API request the page fires** (so we learn the real
filter params instead of guessing), prints it, then pages that filtered API:
```powershell
$env:SEARCH="cyber security ft"; python poc.py # FT M4A4 Cyber Security only
```
The `>>> DISCOVERED sell-orders API call` line shows how the search maps to API
params — that's how the C2 will build targeted jobs.
Run on your own IP first (no proxy) — that's the clean A/B vs. the Selenium run.
If auto-detect can't find a browser, set `BROWSER_PATH` to Chrome or Edge
(`C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe`).
## Step 2: the pull fleet
`worker.py` holds one warm nodriver session and loops: poll the .NET C2 for a job
(a skin+wear search), scrape that search's sell-orders via in-page fetch, and post
the items back. The C2 (`BlueLaminate.C2`) picks the stalest skin+wear from the
catalogue, and on result persists to `cs_money_listings` + `price_history`
(`Source = "csmoney"`), stamping `SkinCondition.ListingsSweptAt`.
Run the C2 (needs Postgres migrated), then the worker:
```powershell
# terminal 1 — the C2 (from repo root)
dotnet run --project BlueLaminate\BlueLaminate.C2 # serves http://localhost:5080
# terminal 2 — the worker
cd worker; .venv\Scripts\Activate.ps1
$env:WORKER_TOKEN="dev-worker-token" # must match the C2's WorkerToken
python worker.py
```
The worker warms the session (you clear Cloudflare once), then runs continuously.
Scale out by starting more workers (each with its own `PROXY`).

71
worker/diag_consent.py Normal file
View File

@@ -0,0 +1,71 @@
"""
Diagnose the cs.money cookie-consent banner so we can dismiss it programmatically.
It's likely a Shadow DOM web component (CookieConsentSystem), which is why
document.querySelectorAll-based clicks miss the real buttons.
Saves:
captures/_consent.png - screenshot (so we can SEE the banner + button positions)
captures/_consent.txt - shadow-host tags + every consent-like button found by
piercing shadow roots, with center coordinates.
cd worker; .venv\\Scripts\\Activate.ps1
python diag_consent.py
"""
import json
import os
import pathlib
import nodriver as uc
URL = os.environ.get("URL", "https://cs.money/market/buy/?search=ak-47+redline")
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
BROWSER_PATH = os.environ.get("BROWSER_PATH")
OUT = pathlib.Path(__file__).parent / "captures"
# Pierce shadow roots to find consent buttons + their viewport-center coords.
DEEP_FIND = r"""
JSON.stringify((()=>{
const hits=[], hosts=[];
function walk(root){
root.querySelectorAll('*').forEach(e=>{
if(e.shadowRoot){ hosts.push(e.tagName.toLowerCase()); walk(e.shadowRoot); }
const t=(e.textContent||'').trim();
if(t.length<40 && /accept all|manage cookies|reject all|confirm my choice|^accept$|^manage$/i.test(t)){
const r=e.getBoundingClientRect();
if(r.width>0&&r.height>0)
hits.push({tag:e.tagName, text:t, x:Math.round(r.x+r.width/2), y:Math.round(r.y+r.height/2)});
}
});
}
walk(document);
return {shadowHosts:[...new Set(hosts)], buttons:hits};
})())
"""
async def main():
OUT.mkdir(exist_ok=True)
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH)
try:
page = await browser.get(URL)
print(f"Loaded {URL}; waiting {SOLVE_SECONDS}s for Cloudflare...")
await page.sleep(SOLVE_SECONDS)
png = str(OUT / "_consent.png")
await page.save_screenshot(png)
print(f"screenshot -> {png}")
raw = await page.evaluate(DEEP_FIND)
info = json.loads(raw) if isinstance(raw, str) else {"error": repr(raw)}
(OUT / "_consent.txt").write_text(json.dumps(info, indent=2), encoding="utf-8")
print("shadow hosts:", info.get("shadowHosts"))
print("consent buttons found:")
for b in info.get("buttons", []):
print(f" {b}")
finally:
browser.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())

View File

@@ -0,0 +1,183 @@
"""
Discover how cs.money paginates a filtered search past the initial ~60 SSR items.
Tests two hypotheses against a high-result search (default "ak-47 redline", which has
well over 60 listings):
A. Does the SSR page honor offset/limit in the URL? Fetch ?search=...&offset=60 and
?search=...&limit=120 and compare item ids to page 1. If disjoint/larger, we can
paginate cheaply by re-fetching the page.
B. The real client "load more": scroll hard to trigger lazy-load and capture any
cs.money /2.0/ XHR via Resource Timing — that request carries the structured
filter params + offset, i.e. a lighter direct-API pagination path.
Findings are printed and saved to captures/_pagination.txt.
cd worker; .venv\\Scripts\\Activate.ps1
python discover_pagination.py
$env:SEARCH="ak-47 redline"; python discover_pagination.py # override the search
"""
import json
import os
import pathlib
import re
import nodriver as uc
from nodriver import cdp
SEARCH = os.environ.get("SEARCH", "ak-47 redline")
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
BROWSER_PATH = os.environ.get("BROWSER_PATH")
PROXY = os.environ.get("PROXY")
BASE = "https://cs.money/market/buy/"
PAGE_PARAMS_RE = re.compile(r'<script\b[^>]*id="__page-params"[^>]*>(.*?)</script>', re.S)
OUT = pathlib.Path(__file__).parent / "captures"
CONSENT = ["Reject all", "Only necessary", "Reject", "Decline", "Deny"]
# Aggressive scroll: window + every scrollable container (the grid scrolls in a div,
# which is why a plain window.scrollTo didn't trigger lazy-load before).
SCROLL_JS = (
"window.scrollTo(0, document.body.scrollHeight);"
"document.querySelectorAll('*').forEach(e=>{"
" if (e.scrollHeight > e.clientHeight + 80) e.scrollTop = e.scrollHeight;});")
async def js(page, expr):
raw = await page.evaluate(f"JSON.stringify({expr})")
try:
return json.loads(raw) if isinstance(raw, str) else None
except (json.JSONDecodeError, TypeError):
return None
async def fetch_text(page, url):
expr = (f"fetch({url!r},{{credentials:'include'}}).then(async r=>"
f"JSON.stringify({{status:r.status, body:await r.text()}}))")
raw = await page.evaluate(expr, await_promise=True)
try:
o = json.loads(raw)
return o.get("status"), o.get("body", "")
except (json.JSONDecodeError, TypeError):
return None, ""
def page_item_ids(html):
m = PAGE_PARAMS_RE.search(html or "")
if not m:
return []
try:
return [it.get("id") for it in json.loads(m.group(1)).get("inventory", {}).get("items", [])]
except json.JSONDecodeError:
return []
async def click_visible(page, pattern):
"""Click the first VISIBLE element whose trimmed text matches `pattern` (case-
insensitive). nodriver's find() was matching hidden/duplicate nodes; restricting
to offsetParent!=null + short text hits the real button."""
expr = ("JSON.stringify((()=>{"
"const re=new RegExp(" + json.dumps(pattern) + ",'i');"
"const els=[...document.querySelectorAll('button,a,[role=\"button\"],span,div')];"
"const b=els.find(e=>e.offsetParent!==null && (e.textContent||'').trim().length<40 "
"&& re.test((e.textContent||'').trim()));"
"if(b){b.click();return true}return false})())")
r = await page.evaluate(expr)
return isinstance(r, str) and "true" in r
async def banner_present(page):
r = await page.evaluate(
"JSON.stringify(/Manage cookies|Accept all/i.test(document.body.innerText||''))")
return isinstance(r, str) and "true" in r
async def dismiss(page):
"""Privacy-preserving first (Manage -> Reject all -> Confirm); if the banner is
still up, fall back to Accept all so the page becomes interactive (discovery
needs scrolling to work)."""
steps = []
if await click_visible(page, "manage cookies|^manage$"):
steps.append("manage")
await page.sleep(1.2)
if await click_visible(page, "reject all"):
steps.append("reject-all")
await page.sleep(0.4)
for c in ("confirm my choice", "^confirm$", "^save$"):
if await click_visible(page, c):
steps.append("confirm")
break
await page.sleep(1)
if await banner_present(page):
steps.append("still-up->accept" if await click_visible(page, "accept all|^accept$") else "still-up")
await page.sleep(0.5)
steps.append("gone" if not await banner_present(page) else "STILL-PRESENT")
return ", ".join(steps)
async def main():
OUT.mkdir(exist_ok=True)
args = [f"--proxy-server={PROXY}"] if PROXY else []
args.append("--blink-settings=imagesEnabled=false")
from urllib.parse import quote_plus
q = quote_plus(SEARCH)
findings = []
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH, browser_args=args)
try:
url0 = f"{BASE}?search={q}"
page = await browser.get(url0)
print(f"Warming on {url0} ({SOLVE_SECONDS}s for Cloudflare)...")
await page.sleep(SOLVE_SECONDS)
print(f"Consent: {await dismiss(page)}")
# --- A. URL offset/limit on the SSR page ---
_, h0 = await fetch_text(page, f"{BASE}?search={q}")
_, h1 = await fetch_text(page, f"{BASE}?search={q}&offset=60")
_, h2 = await fetch_text(page, f"{BASE}?search={q}&limit=120")
a, b, c = page_item_ids(h0), page_item_ids(h1), page_item_ids(h2)
overlap = len(set(a) & set(b))
findings.append(f"page1 ids={len(a)} offset=60 ids={len(b)} (overlap with page1={overlap}) limit=120 ids={len(c)}")
findings.append(f" -> offset works? {'YES (disjoint)' if b and overlap == 0 else 'no/ignored'}")
findings.append(f" -> limit works? {'YES (>60)' if len(c) > 60 else 'no/ignored'}")
# --- B. Trigger client load-more, capture cs.money /2.0/ XHRs ---
# Infinite scroll only fires on GRADUAL downward scrolling — jumping to the
# bottom skips the trigger. So step down in small wheel increments and watch
# the item count grow.
before = set(await js(page, "performance.getEntriesByType('resource').map(e=>e.name)") or [])
async def card_count():
n = await page.evaluate(
"JSON.stringify(document.querySelectorAll('[href*=\"/item/\"],[class*=\"item\" i]').length)")
return n
print(f" cards before scroll: {await card_count()}")
for step in range(60):
try:
await page.send(cdp.input_.dispatch_mouse_event(
type_="mouseWheel", x=720, y=450, delta_x=0, delta_y=500))
except Exception:
pass
await page.sleep(0.7)
if step % 15 == 14:
now = [u for u in (await js(page, "performance.getEntriesByType('resource').map(e=>e.name)") or [])
if u not in before and "cs.money" in u and "metrics." not in u and "traces." not in u]
print(f" step {step+1}: cards={await card_count()} new cs.money reqs={len(now)}")
after = await js(page, "performance.getEntriesByType('resource').map(e=>e.name)") or []
new_xhrs = [u for u in after if u not in before and "cs.money" in u
and "metrics." not in u and "traces." not in u]
findings.append(f"\nclient requests after scrolling ({len(new_xhrs)} new cs.money):")
findings.extend(f" {u}" for u in dict.fromkeys(new_xhrs))
if not new_xhrs:
findings.append(" (none — grid may not lazy-load via XHR, or scroll didn't reach the trigger)")
report = "\n".join(findings)
print("\n=== FINDINGS ===\n" + report)
(OUT / "_pagination.txt").write_text(f"search: {SEARCH}\n\n{report}\n", encoding="utf-8")
print(f"\nsaved to {OUT / '_pagination.txt'}")
finally:
browser.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())

View File

@@ -0,0 +1,96 @@
"""
Find cs.money's price-filter URL param (the basis for price-bucket pagination).
The market has a Price from/to filter in the sidebar. `search=` works via the URL and
the page SSRs the filtered listings into __page-params, so a price param likely works
the same way. We baseline the cheapest set, then try candidate param names with a high
floor and check whether the returned listings actually shift above it.
cd worker; .venv\\Scripts\\Activate.ps1
python discover_price_param.py
"""
import json
import os
import pathlib
import re
from urllib.parse import quote_plus
import nodriver as uc
SEARCH = os.environ.get("SEARCH", "ak-47 redline")
FLOOR = float(os.environ.get("FLOOR", "200"))
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
BROWSER_PATH = os.environ.get("BROWSER_PATH")
BASE = "https://cs.money/market/buy/"
PP = re.compile(r'<script\b[^>]*id="__page-params"[^>]*>(.*?)</script>', re.S)
OUT = pathlib.Path(__file__).parent / "captures"
# Param-name variants for a price floor (and a couple of from/to pairs).
CANDIDATES = [
"minPrice", "priceFrom", "price_from", "priceMin", "min_price",
"priceGte", "from", "price_min", "minprice", "price.gte", "pricegte",
]
async def fetch_prices(page, url):
expr = (f"fetch({url!r},{{credentials:'include'}}).then(async r=>"
f"JSON.stringify({{status:r.status, body:await r.text()}}))")
raw = await page.evaluate(expr, await_promise=True)
try:
body = json.loads(raw).get("body", "")
except (json.JSONDecodeError, TypeError):
return None
m = PP.search(body or "")
if not m:
return None
try:
items = json.loads(m.group(1)).get("inventory", {}).get("items", [])
except json.JSONDecodeError:
return None
return [it.get("pricing", {}) for it in items if it.get("pricing")]
async def main():
OUT.mkdir(exist_ok=True)
q = quote_plus(SEARCH)
lines = []
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH,
browser_args=["--blink-settings=imagesEnabled=false"])
try:
page = await browser.get(f"{BASE}?search={q}")
print(f"Warming ({SOLVE_SECONDS}s)..."); await page.sleep(SOLVE_SECONDS)
# Test minPrice/maxPrice semantics directly (old cs.money API used these).
tests = [
("baseline", f"{BASE}?search={q}"),
("maxPrice=200", f"{BASE}?search={q}&maxPrice=200"),
("minPrice=300", f"{BASE}?search={q}&minPrice=300"),
("minPrice=300&maxPrice=400", f"{BASE}?search={q}&minPrice=300&maxPrice=400"),
("minPrice=500&maxPrice=1000", f"{BASE}?search={q}&minPrice=500&maxPrice=1000"),
]
def rng(pr, field):
vals = [p.get(field) for p in pr if isinstance(p.get(field), (int, float))]
return (min(vals), max(vals)) if vals else (None, None)
for name, url in tests:
pr = await fetch_prices(page, url)
if not pr:
lines.append(f"{name:28} -> no items")
else:
d0, d1 = rng(pr, "default")
c0, c1 = rng(pr, "computed")
b0, b1 = rng(pr, "basePrice")
lines.append(f"{name:28} -> n={len(pr)} default[{d0:.2f},{d1:.2f}] "
f"computed[{c0:.2f},{c1:.2f}] base[{b0:.2f},{b1:.2f}]")
print(lines[-1])
(OUT / "_price_param.txt").write_text(
f"search={SEARCH} floor={FLOOR}\n\n" + "\n".join(lines), encoding="utf-8")
print(f"\nsaved to {OUT/'_price_param.txt'}")
finally:
browser.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())

19
worker/entrypoint.sh Normal file
View File

@@ -0,0 +1,19 @@
#!/usr/bin/env bash
# Start a virtual display, expose it over noVNC, then run the worker headful against it.
set -euo pipefail
DISPLAY_NUM="${DISPLAY:-:99}"
SCREEN="${SCREEN_GEOMETRY:-1440x900x24}"
echo "[entrypoint] starting Xvfb on ${DISPLAY_NUM} (${SCREEN})"
Xvfb "${DISPLAY_NUM}" -screen 0 "${SCREEN}" -nolisten tcp &
sleep 1
echo "[entrypoint] starting x11vnc (display ${DISPLAY_NUM} -> :5900)"
x11vnc -display "${DISPLAY_NUM}" -forever -shared -nopw -quiet -bg
echo "[entrypoint] starting noVNC on :6080 (open http://localhost:6080/vnc.html)"
websockify --web=/usr/share/novnc 6080 localhost:5900 &
echo "[entrypoint] launching worker"
exec python worker.py

285
worker/poc.py Normal file
View File

@@ -0,0 +1,285 @@
"""
Proof-of-concept / pre-fleet validation for the cs.money scraper.
Proves the things we need before building the C2 + worker fleet:
1. nodriver clears cs.money's Cloudflare where .NET Selenium couldn't.
2. a single WARM session can page the sell-orders API deeply without re-challenge.
3. a free-text market search (e.g. "cyber security ft") can be turned into a
filtered sell-orders API call — we DISCOVER the real API params by capturing the
request the page itself fires, instead of guessing.
It opens the market (optionally a search URL) in a real non-headless Chromium, lets
you clear Cloudflare, dismisses the cookie banner (privacy-preserving), captures the
sell-orders request the page makes, then pages that API from inside the cleared page
(same-origin fetch carries cf_clearance), pacing itself and stopping on re-challenge.
cd worker
.venv\\Scripts\\Activate.ps1
pip install -r requirements.txt
python poc.py # whole-market sweep
$env:SEARCH="cyber security ft"; python poc.py # targeted: FT M4A4 Cyber Security
Env knobs (all optional):
SEARCH free-text market search; when set, scrape only those results
MARKET_URL market page base (default the buy market)
SOLVE_SECONDS seconds to wait for you to clear Cloudflare (default 30)
PAGES how many offset pages (60 each) to attempt (default 20)
START_OFFSET first offset (default 0)
DELAY / JITTER base + random seconds between fetches (default 2.0 / 1.5)
PROXY host:port for an auth-free proxy (omit to use your own IP)
BROWSER_PATH path to Chrome/Edge if auto-detect fails
"""
import json
import os
import pathlib
import random
from urllib.parse import quote_plus, urlsplit, parse_qsl, urlencode, urlunsplit
import nodriver as uc
from nodriver import cdp
SEARCH = os.environ.get("SEARCH")
MARKET_URL = os.environ.get("MARKET_URL", "https://cs.money/market/buy/")
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
PAGES = int(os.environ.get("PAGES", "20"))
START_OFFSET = int(os.environ.get("START_OFFSET", "0"))
DELAY = float(os.environ.get("DELAY", "2.0"))
JITTER = float(os.environ.get("JITTER", "1.5"))
PROXY = os.environ.get("PROXY")
BROWSER_PATH = os.environ.get("BROWSER_PATH")
# Fallback template if we fail to capture the page's own request (offset = {}).
DEFAULT_TEMPLATE = "https://cs.money/2.0/market/sell-orders?limit=60&offset={}"
OUT_DIR = pathlib.Path(__file__).parent / "captures"
CONSENT_LABELS = ["Reject all", "Reject All", "Only necessary", "Necessary only",
"Reject", "Decline", "Deny"]
# Filled by the CDP network handler with sell-orders request URLs the page fires.
_seen_urls: list[str] = []
def looks_like_challenge(body: str) -> bool:
s = (body or "").lstrip()
return not s or s.startswith("<") or "Just a moment" in body or "challenge-platform" in body
def decimals(v: float) -> int:
r = repr(float(v))
return len(r.split(".")[-1]) if "." in r else 0
def template_from(url: str) -> str:
"""Turn a captured sell-orders URL into a template with offset as '{}',
preserving every other param (the search/filter encoding we want to learn)."""
parts = urlsplit(url)
q = [(k, v) for k, v in parse_qsl(parts.query, keep_blank_values=True) if k != "offset"]
if not any(k == "limit" for k, _ in q):
q.append(("limit", "60"))
base_q = urlencode(q)
new_q = (base_q + "&" if base_q else "") + "offset={}"
return urlunsplit((parts.scheme, parts.netloc, parts.path, new_q, ""))
async def dismiss_consent(page) -> str | None:
"""Best-effort, privacy-preserving — never clicks 'Accept all'."""
for label in CONSENT_LABELS:
try:
el = await page.find(label, best_match=True, timeout=2)
except Exception:
el = None
if el:
try:
await el.click()
return label
except Exception:
pass
return None
async def fetch_json(page, url: str) -> tuple[str, str]:
expr = (
f"fetch({url!r}, {{credentials:'include', headers:{{'accept':'application/json'}}}})"
f".then(async r => JSON.stringify({{status: r.status, body: await r.text()}}))"
)
raw = await page.evaluate(expr, await_promise=True)
if not isinstance(raw, str):
return ("-1", "")
try:
obj = json.loads(raw)
return (str(obj.get("status", "-1")), obj.get("body", ""))
except json.JSONDecodeError:
return ("-1", raw)
async def main():
OUT_DIR.mkdir(exist_ok=True)
args = [f"--proxy-server={PROXY}"] if PROXY else []
target_url = MARKET_URL
tag = "market"
if SEARCH:
sep = "&" if "?" in MARKET_URL else "?"
target_url = f"{MARKET_URL}{sep}search={quote_plus(SEARCH)}"
tag = "search_" + "".join(c if c.isalnum() else "_" for c in SEARCH)[:40]
print(f"Launching nodriver Chromium (proxy={PROXY or 'none / own IP'})...")
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH, browser_args=args)
pages_ok = items_total = floats_total = low_prec = 0
dp_min, dp_max = 99, 0
deepest_offset = None
reason = "completed (hit PAGES limit)"
try:
# Open a blank tab first so the network handler is attached BEFORE the page
# fires its filtered sell-orders request (otherwise we'd miss it).
page = await browser.get("about:blank")
async def on_request(evt):
url = evt.request.url
if "/market/sell-orders" in url:
_seen_urls.append(url)
page.add_handler(cdp.network.RequestWillBeSent, on_request)
try:
await page.send(cdp.network.enable())
except Exception as ex:
print(f"(network capture unavailable: {ex})")
print(f"Opening {target_url}")
await page.get(target_url)
print(f"Solve any Cloudflare challenge. Waiting {SOLVE_SECONDS}s for the grid...")
await page.sleep(SOLVE_SECONDS)
clicked = await dismiss_consent(page)
print(f"Consent banner: {'dismissed via ' + clicked if clicked else 'left up (does not block fetch)'}")
# Reliable discovery via the Resource Timing API: the browser records EVERY
# request the page made, so we read the real sell-orders URL straight out of it
# (no flaky CDP event timing). Also dump nearby API calls for context.
# cs.money is an Astro SSR app — the initial filtered listings are rendered
# server-side (no client XHR to capture). Scroll to provoke lazy-load
# pagination, which DOES fire a client request carrying the real filter params.
print("Scrolling to trigger lazy-load pagination...")
for _ in range(6):
try:
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
except Exception:
pass
await page.sleep(2)
# nodriver returns arrays unreliably from evaluate(), so JSON.stringify in JS
# and json.loads here (the string path is proven by fetch_json).
async def js_list(expr: str) -> list:
raw = await page.evaluate(f"JSON.stringify({expr})")
try:
return json.loads(raw) if isinstance(raw, str) else []
except (json.JSONDecodeError, TypeError):
return []
try:
all_urls = await js_list("performance.getEntriesByType('resource').map(e=>e.name)")
print(f">>> Resource Timing saw {len(all_urls)} requests total")
if all_urls:
(OUT_DIR / "_all_requests.txt").write_text(
"\n".join(dict.fromkeys(all_urls)), encoding="utf-8")
sell = [u for u in all_urls if "/market/sell-orders" in u]
_seen_urls.extend(sell)
api = [u for u in all_urls if "cs.money/" in u and ("/2.0/" in u or "/1.0/" in u)]
if api:
(OUT_DIR / "_api_calls.txt").write_text("\n".join(dict.fromkeys(api)), encoding="utf-8")
print(f">>> {len(set(api))} cs.money API calls; saved to {OUT_DIR / '_api_calls.txt'}")
except Exception as ex:
print(f"(resource-timing query failed: {ex})")
# Dump the SSR'd page so we can see how the filter is encoded and where the
# listings data lives (Astro embeds island props / hydration JSON in the HTML).
try:
html = await page.evaluate("document.documentElement.outerHTML")
if isinstance(html, str) and html:
(OUT_DIR / "_page.html").write_text(html, encoding="utf-8")
print(f">>> saved page HTML ({len(html)} bytes) to {OUT_DIR / '_page.html'}")
except Exception as ex:
print(f"(page HTML dump failed: {ex})")
# Discovery: what sell-orders request did the page actually make?
if _seen_urls:
captured = _seen_urls[-1]
template = template_from(captured)
print("\n>>> DISCOVERED sell-orders API call the page fired:")
print(f" {captured}")
print(f">>> pagination template: {template}\n")
# Persist it — the console line is easy to lose, and this is the one bit
# of ground truth (the real filter-param scheme) we need.
(OUT_DIR / "_discovered.txt").write_text(
"ALL captured sell-orders requests:\n"
+ "\n".join(dict.fromkeys(_seen_urls))
+ f"\n\npagination template:\n{template}\n",
encoding="utf-8")
print(f">>> saved to {OUT_DIR / '_discovered.txt'}")
else:
template = DEFAULT_TEMPLATE
if SEARCH:
template = template.replace("offset={}", f"search={quote_plus(SEARCH)}&offset={{}}")
print(f"\n(no request captured; falling back to template: {template})\n")
for i in range(PAGES):
offset = START_OFFSET + i * 60
status, body = await fetch_json(page, template.format(offset))
if looks_like_challenge(body):
print(f" page {i + 1} [offset {offset}]: RE-CHALLENGED (status {status}). Stopping.")
(OUT_DIR / f"{tag}_challenge_offset_{offset}.html").write_text(body, encoding="utf-8")
reason = f"re-challenged at offset {offset}"
break
try:
items = json.loads(body).get("items", [])
except json.JSONDecodeError:
print(f" page {i + 1} [offset {offset}]: non-JSON (status {status}). Stopping.")
reason = f"non-JSON at offset {offset}"
break
if not items:
print(f" page {i + 1} [offset {offset}]: 0 items — end of results.")
reason = "end of results"
break
(OUT_DIR / f"{tag}_offset_{offset:06d}.json").write_text(body, encoding="utf-8")
pages_ok += 1
deepest_offset = offset
items_total += len(items)
names = set()
for it in items:
fl = it.get("asset", {}).get("float")
if fl is not None:
floats_total += 1
d = decimals(fl)
dp_min, dp_max = min(dp_min, d), max(dp_max, d)
if d <= 6: # short repr — exact binary fraction (e.g. 1/16), not truncation
low_prec += 1
names.add(it.get("asset", {}).get("names", {}).get("full"))
sample = next(iter(names), None) if SEARCH else None
print(f" page {i + 1} [offset {offset}] OK — {len(items)} items"
+ (f" (e.g. {sample}; {len(names)} distinct names)" if SEARCH else ""))
await page.sleep(DELAY + random.uniform(0, JITTER))
print("\n=== summary ===")
print(f" query: {SEARCH or '(whole market)'}")
print(f" stopped: {reason}")
print(f" clean pages: {pages_ok} deepest offset: {deepest_offset} items: {items_total}")
if floats_total:
# Truncation would make MANY values short, not one exact binary fraction.
verdict = "FULL precision" if low_prec / floats_total < 0.02 else "POSSIBLE TRUNCATION"
print(f" floats: {floats_total} items, {dp_max}-decimal max, "
f"{low_prec} short-repr (exact fractions) — {verdict}")
print(f" files in {OUT_DIR}")
finally:
browser.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())

77
worker/probe_filters.py Normal file
View File

@@ -0,0 +1,77 @@
"""
Probe which extra filter params cs.money's SSR market search honors, so we can
pick a SECOND pagination axis to break apart dense price bands that saturate the
60-cap (see diag_windows.py). For a saturating search we try candidate params and
report how the returned set's size + float range + price range change.
python probe_filters.py "Glock-18 Candy Apple mw"
"""
import asyncio
import sys
import nodriver as uc
import worker
BASE = "https://cs.money/market/buy/?search={q}"
# (label, extra query string) — candidates cs.money markets commonly expose.
CANDIDATES = [
("baseline", ""),
("sort=price asc", "&order=asc&sort=price"),
("sort=price desc", "&order=desc&sort=price"),
("sort=float", "&sort=float"),
("minFloat/maxFloat lo", "&minFloat=0.07&maxFloat=0.10"),
("minFloat/maxFloat hi", "&minFloat=0.10&maxFloat=0.15"),
("maxWear lo", "&minWear=0.07&maxWear=0.10"),
("isStatTrak=true", "&isStatTrak=true"),
("hasStickers=false", "&hasStickers=false"),
]
def stats(items):
floats = [(((it.get("asset") or {}).get("float"))) for it in items]
floats = [f for f in floats if isinstance(f, (int, float))]
bases = []
for it in items:
p = it.get("pricing") or {}
b = p.get("basePrice", p.get("computed"))
if isinstance(b, (int, float)):
bases.append(b)
fr = f"[{min(floats):.4f},{max(floats):.4f}]" if floats else "[-]"
br = f"[{min(bases):.2f},{max(bases):.2f}]" if bases else "[-]"
return f"n={len(items):3d} float{fr} base{br}"
async def main():
search = " ".join(sys.argv[1:]) or "Glock-18 Candy Apple mw"
q = worker.urllib.parse.quote_plus(search)
args = ["--blink-settings=imagesEnabled=false"]
browser = await uc.start(headless=False, browser_args=args)
try:
page = await browser.get("about:blank")
await worker.warm(page)
base_ids = None
for label, extra in CANDIDATES:
url = BASE.format(q=q) + extra
status, body = await worker.fetch_json(page, url)
if "Just a moment" in body or "challenge-platform" in body:
print(f" {label:24s} CHALLENGED"); break
items = worker.extract_items(body)
ids = {it.get("id") for it in items}
if label == "baseline":
base_ids = ids
delta = ""
else:
# If a param is IGNORED, the set is identical to baseline.
delta = "IGNORED (== baseline)" if ids == base_ids else f"CHANGED ({len(ids ^ (base_ids or set()))} diff ids)"
print(f" {label:24s} {stats(items)} {delta}")
await page.sleep(worker.DELAY)
finally:
browser.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())

5
worker/requirements.txt Normal file
View File

@@ -0,0 +1,5 @@
# cs.money scraping worker.
# nodriver = the modern successor to undetected-chromedriver: it drives a normal
# Chromium over CDP directly (no chromedriver, so none of the cdc_/webdriver tells
# that got our .NET Selenium setup insta-challenged by Cloudflare).
nodriver>=0.39

77
worker/verify_count.py Normal file
View File

@@ -0,0 +1,77 @@
"""
One-off count verification: scrape a single skin+wear search from cs.money and
report how many distinct sell-orders come back, reusing the production worker's
warm-session + price-window bisection logic (worker.scrape_job).
Use it to sanity-check that our pagination actually recovers the FULL listing
count cs.money shows on the site (the known ground truth) for one query.
cd worker
.venv\\Scripts\\Activate.ps1
python verify_count.py "Desert Eagle Bronze Deco fn"
Env knobs (same meaning as worker.py): SOLVE_SECONDS, DELAY, JITTER, PROXY,
BROWSER_PATH, LOAD_IMAGES. MAX_FETCHES caps window fetches (default 80).
"""
import asyncio
import os
import sys
from collections import Counter
import nodriver as uc
import worker
MAX_FETCHES = int(os.environ.get("MAX_FETCHES", "80"))
async def main():
search = " ".join(sys.argv[1:]) or "Desert Eagle Bronze Deco fn"
args = [f"--proxy-server={worker.PROXY}"] if worker.PROXY else []
if not worker.LOAD_IMAGES:
args.append("--blink-settings=imagesEnabled=false")
if os.environ.get("CHROME_NO_SANDBOX") == "1":
args += ["--no-sandbox", "--disable-dev-shm-usage"]
print(f"Verifying count for search {search!r} (proxy={worker.PROXY or 'own IP'})")
browser = await uc.start(
headless=False, browser_executable_path=worker.BROWSER_PATH, browser_args=args)
try:
page = await browser.get("about:blank")
await worker.warm(page)
job = {"search": search, "maxPages": MAX_FETCHES}
items, fetches, reason = await worker.scrape_job(page, job)
print("\n=== result ===")
print(f" search: {search}")
print(f" stopped: {reason}")
print(f" fetches: {fetches}")
print(f" DISTINCT sell-orders (deduped by id): {len(items)}")
# Break down what came back so we can see whether the count is inflated by
# off-target names/wears (the C2's name+wear filter would drop those later).
names = Counter()
wears = Counter()
st = 0
for it in items:
asset = it.get("asset") or {}
names[(asset.get("names") or {}).get("full")] += 1
wears[asset.get("quality")] += 1
if asset.get("isStatTrak"):
st += 1
print(f" StatTrak in set: {st}")
print(" by name:")
for name, n in names.most_common():
print(f" {n:4d} {name}")
print(" by wear (quality code):")
for w, n in wears.most_common():
print(f" {n:4d} {w}")
finally:
browser.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())

View File

@@ -0,0 +1,79 @@
"""
Validate the float-cursor scrape by walking the float axis in BOTH directions and
comparing the recovered sell-order id sets. If ascending (lowest float first) and
descending (highest float first) independently land on the same listings, the
cursor is exhaustive and order-independent — i.e. the count is real, not an artifact
of walk direction or boundary double-counting.
python verify_crosscheck.py "Glock-18 Candy Apple mw"
"""
import asyncio
import sys
import nodriver as uc
import worker
CAP = worker.PAGE_CAP
ASC = ("https://cs.money/market/buy/?search={q}"
"&order=asc&sort=float&minFloat={cur:.12f}&maxFloat=1")
DESC = ("https://cs.money/market/buy/?search={q}"
"&order=desc&sort=float&minFloat=0&maxFloat={cur:.12f}")
async def walk(page, q, template, ascending, max_fetches=60):
seen = {}
cur = 0.0 if ascending else 1.0
fetches = 0
while fetches < max_fetches:
status, body = await worker.fetch_json(page, template.format(q=q, cur=cur))
fetches += 1
if "Just a moment" in body or "challenge-platform" in body:
return seen, fetches, "challenged"
items = worker.extract_items(body)
floats = []
for it in items:
if it.get("id") is not None:
seen[it["id"]] = it
fl = (it.get("asset") or {}).get("float")
if isinstance(fl, (int, float)):
floats.append(fl)
if len(items) < CAP:
return seen, fetches, "completed"
nxt = (max(floats) if ascending else min(floats)) if floats else None
if nxt is None or (ascending and nxt <= cur) or (not ascending and nxt >= cur):
return seen, fetches, "stuck"
cur = nxt
await page.sleep(worker.DELAY)
return seen, fetches, "fetch-cap"
async def main():
search = " ".join(sys.argv[1:]) or "Glock-18 Candy Apple mw"
q = worker.urllib.parse.quote_plus(search)
browser = await uc.start(headless=False, browser_args=["--blink-settings=imagesEnabled=false"])
try:
page = await browser.get("about:blank")
await worker.warm(page)
asc, fa, ra = await walk(page, q, ASC, ascending=True)
print(f"ASC : {len(asc):4d} ids {fa} fetches {ra}")
desc, fd, rd = await walk(page, q, DESC, ascending=False)
print(f"DESC: {len(desc):4d} ids {fd} fetches {rd}")
a, d = set(asc), set(desc)
union = a | d
print("\n=== cross-check ===")
print(f" ASC only: {len(a - d)}")
print(f" DESC only: {len(d - a)}")
print(f" in both: {len(a & d)}")
print(f" UNION (distinct):{len(union)}")
agree = "AGREE — count is solid" if a == d else "DISAGREE — one walk missed listings"
print(f" verdict: {agree}")
finally:
browser.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())

453
worker/worker.py Normal file
View File

@@ -0,0 +1,453 @@
"""
cs.money scrape worker (pull model).
Holds ONE warm nodriver session (the thing that beats Cloudflare), then loops:
poll the .NET C2 for a job, scrape that skin+wear's sell-orders via in-page fetch
from the cleared session, and post the results back. The C2 owns job selection
(stalest skin+wear first) and persistence; this worker just fetches and forwards.
cd worker
.venv\\Scripts\\Activate.ps1
pip install -r requirements.txt
python worker.py
Env knobs:
C2_URL C2 base URL (default http://localhost:5080)
WORKER_TOKEN shared secret, must match the C2's WorkerToken (default dev-worker-token)
MARKET_URL market page to warm the session on (default the buy market)
SOLVE_SECONDS seconds to clear Cloudflare on startup (default 30)
DELAY / JITTER base + random seconds between page fetches (default 2.0 / 1.5)
IDLE_SECONDS sleep when the C2 has no work (default 10)
BROWSER_PATH path to Chrome/Edge if auto-detect fails
Proxy (pick one; IPRoyal takes priority when its creds are set):
IPROYAL_USERNAME IPRoyal residential account username
IPROYAL_PASSWORD IPRoyal residential account password
IPROYAL_COUNTRY ISO country for the exit (default us; blank = any)
IPROYAL_LIFETIME_MIN sticky-IP hold in minutes (default 60)
PROXY host:port for an auth-free proxy (fallback; omit to use your own IP)
Each worker process mints its own random IPRoyal sticky session at startup, so N
workers get N distinct residential exit IPs with no coordination — scale with
`docker compose up --scale worker=N`. On a Cloudflare challenge the worker rotates
to a fresh session (new IP) and re-warms. Chromium can't carry proxy credentials on
--proxy-server, so we run a tiny in-process forwarder (LocalForwardingProxy below)
that injects the IPRoyal auth and chains to the gateway; Chrome talks only to an
auth-free 127.0.0.1 endpoint, keeping us at zero CDP (a CDP auth handler is a
Cloudflare tell).
"""
import asyncio
import base64
import json
import os
import random
import re
import urllib.error
import urllib.parse
import urllib.request
import uuid
import nodriver as uc
C2_URL = os.environ.get("C2_URL", "http://localhost:5080").rstrip("/")
TOKEN = os.environ.get("WORKER_TOKEN", "dev-worker-token")
MARKET_URL = os.environ.get("MARKET_URL", "https://cs.money/market/buy/")
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
DELAY = float(os.environ.get("DELAY", "2.0"))
JITTER = float(os.environ.get("JITTER", "1.5"))
IDLE_SECONDS = int(os.environ.get("IDLE_SECONDS", "10"))
PROXY = os.environ.get("PROXY")
BROWSER_PATH = os.environ.get("BROWSER_PATH")
# IPRoyal residential gateway. One fixed host/port; country, sticky-session id and
# lifetime are encoded as underscore params appended to the password (see
# _iproyal_password). Mirrors the .NET IpRoyalProxyProvider scheme.
IPROYAL_HOST = os.environ.get("IPROYAL_HOST", "geo.iproyal.com")
IPROYAL_PORT = int(os.environ.get("IPROYAL_PORT", "12321"))
IPROYAL_USERNAME = os.environ.get("IPROYAL_USERNAME")
IPROYAL_PASSWORD = os.environ.get("IPROYAL_PASSWORD")
IPROYAL_COUNTRY = os.environ.get("IPROYAL_COUNTRY", "us").strip().lower()
IPROYAL_LIFETIME_MIN = int(os.environ.get("IPROYAL_LIFETIME_MIN", "60"))
# Residential proxy is metered per GB. Cloudflare gates on JS, not images, and the
# sell-orders API is pure JSON — so block images by default to slash page-render
# bandwidth. Set LOAD_IMAGES=1 to re-enable (e.g. for debugging the visible page).
LOAD_IMAGES = os.environ.get("LOAD_IMAGES") == "1"
# cs.money is an Astro SSR app: the free-text market search filters server-side and
# the resulting listings are embedded in the page as a __page-params JSON blob. The
# /2.0/market/sell-orders API rejects a `search` param (HTTP 400), so we fetch the
# PAGE for a search and read the embedded items — same item shape as the API.
#
# A page returns at most 60 and offset is ignored, so we paginate with a FORWARD
# CURSOR on float: cs.money honors `order=asc&sort=float` + `minFloat`, and float is
# full-precision and effectively unique per item. We grab the 60 lowest-float items
# at/above `lo`, advance `lo` to the highest float returned, and repeat until a page
# is under the cap. (The old minPrice/maxPrice bisection silently truncated cheap
# skins: >60 listings can share a sub-$0.02 reference band, which no price window can
# split — floats almost never tie, so the cursor always makes progress.)
PAGE = ("https://cs.money/market/buy/?search={search}"
"&order=asc&sort=float&minFloat={lo:.12f}&maxFloat=1")
PAGE_CAP = 60 # items per SSR page
PAGE_PARAMS_RE = re.compile(
r'<script\b[^>]*id="__page-params"[^>]*>(.*?)</script>', re.S)
# --- IPRoyal residential proxy ----------------------------------------------------
def _new_session_id() -> str:
"""Short, opaque, URL-safe token. IPRoyal pins one residential exit IP per
distinct session value, so a fresh id == a fresh IP."""
return uuid.uuid4().hex[:10]
def _iproyal_password(session_id: str) -> str:
"""Bake the targeting/session knobs onto the account password, IPRoyal-style:
"<pass>_country-us_session-<id>_lifetime-60m". Country is optional."""
pw = IPROYAL_PASSWORD
if IPROYAL_COUNTRY:
pw += f"_country-{IPROYAL_COUNTRY}"
pw += f"_session-{session_id}_lifetime-{IPROYAL_LIFETIME_MIN}m"
return pw
class LocalForwardingProxy:
"""In-process HTTP proxy on 127.0.0.1 that chains every connection to the IPRoyal
gateway, injecting the Proxy-Authorization header itself. Chromium ignores creds in
--proxy-server and the in-browser ways to answer the gateway's 407 (a CDP auth
handler, or a disabled MV2 extension) are Cloudflare tells — so we terminate the
browser->proxy hop locally and add auth here, leaving Chrome to talk to an auth-free
endpoint at zero CDP. HTTPS (all cs.money serves) flows through the CONNECT tunnel,
so this proxy only relays ciphertext and never sees plaintext. Ported from the .NET
LocalForwardingProxy. The active session token can be swapped live (set_password) to
move to a fresh exit IP without restarting the browser. (New tunnels pick up the new
IP; any still-open keep-alive tunnel stays on the old one until it closes.)"""
def __init__(self, host: str, port: int, username: str, password: str):
self._host = host
self._port = port
self._username = username
self._password = password
self._server: asyncio.AbstractServer | None = None
self.endpoint = ""
def set_password(self, password: str) -> None:
self._password = password
def _auth_header(self) -> str:
token = base64.b64encode(f"{self._username}:{self._password}".encode()).decode()
return f"Proxy-Authorization: Basic {token}\r\n"
async def start(self) -> "LocalForwardingProxy":
self._server = await asyncio.start_server(self._handle, "127.0.0.1", 0)
port = self._server.sockets[0].getsockname()[1]
self.endpoint = f"127.0.0.1:{port}"
return self
async def stop(self) -> None:
if self._server is not None:
self._server.close()
try:
await self._server.wait_closed()
except Exception:
pass
@staticmethod
async def _read_header(reader: asyncio.StreamReader) -> str | None:
"""Read up to the end of the HTTP header block (CRLFCRLF). None on EOF/overflow."""
try:
data = await reader.readuntil(b"\r\n\r\n")
except (asyncio.IncompleteReadError, asyncio.LimitOverrunError):
return None
return data.decode("latin-1")
async def _handle(self, client_reader: asyncio.StreamReader, client_writer: asyncio.StreamWriter) -> None:
up_writer: asyncio.StreamWriter | None = None
try:
header = await self._read_header(client_reader)
if not header:
return
parts = header.split("\r\n", 1)[0].split(" ")
if len(parts) < 2:
return
method, target = parts[0], parts[1]
up_reader, up_writer = await asyncio.open_connection(self._host, self._port)
if method.upper() == "CONNECT":
# HTTPS: open an authenticated tunnel upstream, then relay raw bytes.
up_writer.write(
f"CONNECT {target} HTTP/1.1\r\nHost: {target}\r\n{self._auth_header()}\r\n".encode())
await up_writer.drain()
up_header = await self._read_header(up_reader)
status = up_header.split(" ", 2) if up_header else []
if len(status) < 2 or status[1] != "200":
line = (up_header or "no response").split("\r\n", 1)[0]
print(f" proxy: upstream refused CONNECT {target}: {line}")
client_writer.write(b"HTTP/1.1 502 Bad Gateway\r\nConnection: close\r\n\r\n")
await client_writer.drain()
return
client_writer.write(b"HTTP/1.1 200 Connection established\r\n\r\n")
await client_writer.drain()
else:
# Plain HTTP: re-inject the request upstream with auth, then relay.
idx = header.index("\r\n") + 2
up_writer.write((header[:idx] + self._auth_header() + header[idx:]).encode())
await up_writer.drain()
await self._relay(client_reader, client_writer, up_reader, up_writer)
except Exception:
pass # one bad tunnel must never take down the listener
finally:
for w in (client_writer, up_writer):
if w is not None:
try:
w.close()
except Exception:
pass
@staticmethod
async def _relay(
client_reader: asyncio.StreamReader, client_writer: asyncio.StreamWriter,
up_reader: asyncio.StreamReader, up_writer: asyncio.StreamWriter) -> None:
async def pipe(reader: asyncio.StreamReader, writer: asyncio.StreamWriter) -> None:
try:
while data := await reader.read(65536):
writer.write(data)
await writer.drain()
except Exception:
pass
await asyncio.gather(
pipe(client_reader, up_writer),
pipe(up_reader, client_writer),
)
def looks_like_challenge(body: str) -> bool:
s = (body or "").lstrip()
return not s or s.startswith("<") or "Just a moment" in body or "challenge-platform" in body
# --- C2 HTTP (stdlib, run off the event loop) -------------------------------------
def _get_job_sync():
req = urllib.request.Request(f"{C2_URL}/jobs/next", headers={"X-Worker-Token": TOKEN})
try:
with urllib.request.urlopen(req, timeout=15) as r:
if r.status == 204:
return None
return json.loads(r.read() or b"null")
except urllib.error.HTTPError as e:
print(f" C2 /jobs/next -> HTTP {e.code}")
return None
except urllib.error.URLError as e:
print(f" C2 unreachable: {e}")
return None
def _post_result_sync(job_id: str, payload: dict):
data = json.dumps(payload).encode()
req = urllib.request.Request(
f"{C2_URL}/jobs/{job_id}/result", data=data, method="POST",
headers={"X-Worker-Token": TOKEN, "Content-Type": "application/json"})
try:
with urllib.request.urlopen(req, timeout=60) as r:
return json.loads(r.read() or b"null")
except urllib.error.HTTPError as e:
print(f" C2 result -> HTTP {e.code}: {e.read()[:200]!r}")
return None
except urllib.error.URLError as e:
print(f" C2 unreachable posting result: {e}")
return None
async def get_job():
return await asyncio.to_thread(_get_job_sync)
async def post_result(job_id, payload):
return await asyncio.to_thread(_post_result_sync, job_id, payload)
# --- scraping ---------------------------------------------------------------------
async def fetch_json(page, url: str) -> tuple[str, str]:
expr = (
f"fetch({url!r}, {{credentials:'include', headers:{{'accept':'application/json'}}}})"
f".then(async r => JSON.stringify({{status: r.status, body: await r.text()}}))"
)
raw = await page.evaluate(expr, await_promise=True)
if not isinstance(raw, str):
return ("-1", "")
try:
obj = json.loads(raw)
return (str(obj.get("status", "-1")), obj.get("body", ""))
except json.JSONDecodeError:
return ("-1", raw)
async def _click(page, text, timeout=3):
try:
el = await page.find(text, best_match=True, timeout=timeout)
if el:
await el.click()
return True
except Exception:
pass
return False
async def dismiss_consent(page):
"""Privacy-preserving. The banner only offers 'Accept all' / 'Manage cookies';
the Reject-all control lives inside the Manage window. So: Manage -> Reject all ->
Confirm. (The data path reads SSR __page-params regardless, but this keeps the
session honest and unblocks any future interaction.)"""
steps = []
if await _click(page, "Manage cookies") or await _click(page, "Manage"):
await page.sleep(1)
if await _click(page, "Reject all"):
steps.append("reject-all")
for c in ("Confirm my choice", "Confirm", "Save"):
if await _click(page, c):
steps.append(f"confirm:{c}")
break
return ", ".join(steps) if steps else None
async def warm(page):
"""Open the market and clear Cloudflare so the session holds cf_clearance."""
print(f"Warming session at {MARKET_URL} (clear Cloudflare; {SOLVE_SECONDS}s)...")
await page.get(MARKET_URL)
await page.sleep(SOLVE_SECONDS)
clicked = await dismiss_consent(page)
print(f"Consent: {'dismissed via ' + clicked if clicked else 'left up'}")
def extract_items(html: str) -> list:
"""Pull inventory.items out of the page's __page-params JSON blob."""
m = PAGE_PARAMS_RE.search(html)
if not m:
return []
try:
return json.loads(m.group(1)).get("inventory", {}).get("items", []) or []
except json.JSONDecodeError:
return []
async def scrape_job(page, job) -> tuple[list, int, str]:
"""Scrape ALL listings for one skin+wear via a forward float cursor.
A search page returns at most 60 items and ignores offset, but cs.money sorts by
float (order=asc&sort=float) and filters by minFloat. So we walk the float axis:
grab the 60 lowest-float items at/above `lo`, advance `lo` to the highest float on
the page, and repeat until a page is under the cap. The boundary item is re-fetched
(minFloat is inclusive) and dropped by the id dedup. Returns (items, fetches, reason).
"""
search = urllib.parse.quote_plus(job["search"])
max_fetches = job.get("maxPages", 40) # safety cap on page fetches per job
seen: dict = {}
fetches = 0
lo = 0.0
reason = "completed"
while fetches < max_fetches:
status, body = await fetch_json(page, PAGE.format(search=search, lo=lo))
fetches += 1
if "Just a moment" in body or "challenge-platform" in body:
return list(seen.values()), fetches, "challenged"
items = extract_items(body)
floats = []
for it in items:
if it.get("id") is not None:
seen[it["id"]] = it
fl = (it.get("asset") or {}).get("float")
if isinstance(fl, (int, float)):
floats.append(fl)
if len(items) < PAGE_CAP:
break # last page — fewer than the cap means we've seen everything
# Advance the cursor past the highest float on this page. Items at exactly that
# float are re-fetched next round (minFloat is inclusive) and deduped by id.
nxt = max(floats) if floats else None
if nxt is None or nxt <= lo:
# Cursor can't advance: >60 listings share a single float value, or the
# items carry no float. Bail loudly rather than spin — a flagged gap beats
# a silent one (this is the failure the price-window version hid).
reason = "stuck-float-tie"
break
lo = nxt
await page.sleep(DELAY + random.uniform(0, JITTER))
else:
reason = "fetch-cap"
return list(seen.values()), fetches, reason
async def main():
# IPRoyal (auth'd, per-worker sticky IP) takes priority; else a plain auth-free
# PROXY; else this host's own IP. The forwarder injects IPRoyal auth so Chrome
# only ever sees an auth-free 127.0.0.1 endpoint.
forwarder = None
session_id = None
if IPROYAL_USERNAME and IPROYAL_PASSWORD:
session_id = _new_session_id()
forwarder = await LocalForwardingProxy(
IPROYAL_HOST, IPROYAL_PORT, IPROYAL_USERNAME, _iproyal_password(session_id)).start()
proxy = forwarder.endpoint
proxy_label = f"iproyal[{IPROYAL_COUNTRY or 'any'}] session {session_id} via {forwarder.endpoint}"
else:
proxy = PROXY
proxy_label = PROXY or "own IP"
args = [f"--proxy-server={proxy}"] if proxy else []
if not LOAD_IMAGES:
# Disable image loading at the engine level — the dominant bandwidth cost on
# an image-heavy market, and unneeded for CF clearance or the JSON API.
args.append("--blink-settings=imagesEnabled=false")
if os.environ.get("CHROME_NO_SANDBOX") == "1":
# Required when running Chromium as root in a container.
args += ["--no-sandbox", "--disable-dev-shm-usage"]
print(f"Starting worker (C2={C2_URL}, proxy={proxy_label}, images={'on' if LOAD_IMAGES else 'off'})...")
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH, browser_args=args)
try:
page = await browser.get("about:blank")
await warm(page)
while True:
job = await get_job()
if not job:
await asyncio.sleep(IDLE_SECONDS)
continue
print(f"Job {job['jobId'][:8]} — search {job['search']!r}")
items, pages, reason = await scrape_job(page, job)
if reason == "challenged":
# The exit IP is likely flagged. On IPRoyal, rotate to a fresh sticky
# session (new IP) before re-warming; otherwise just re-solve in place.
if forwarder is not None:
session_id = _new_session_id()
forwarder.set_password(_iproyal_password(session_id))
print(f" challenged; rotating exit IP -> session {session_id}, re-warming...")
else:
print(" re-challenged; re-warming session...")
await warm(page)
result = await post_result(job["jobId"], {
"items": items, "pages": pages, "stoppedReason": reason})
summary = (f"matched {result.get('matched')}, new {result.get('inserted')}, "
f"upd {result.get('updated')}, removed {result.get('removed')}") if result else "post failed"
print(f" scraped {len(items)} items ({pages}p, {reason}) -> {summary}")
await page.sleep(DELAY + random.uniform(0, JITTER))
finally:
browser.stop()
if forwarder is not None:
await forwarder.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())