Add cs.money worker stack with per-worker IPRoyal residential proxy
Brings up the pull-model scraper: the .NET C2 hands skin+wear jobs to Python nodriver workers that scrape cs.money and post results back, plus the supporting Core/EFCore data model, migrations, and docker-compose orchestration. IPRoyal proxying lets workers scale horizontally with a distinct residential exit IP each: every worker process mints its own sticky session at startup, and an in-process forwarding proxy injects the gateway auth so Chromium talks only to an auth-free localhost endpoint (zero CDP). On a Cloudflare challenge a worker rotates to a fresh session/IP and re-warms. Verified end-to-end against live IPRoyal: distinct US residential exits per worker and IP rotation on demand. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
3
worker/.gitattributes
vendored
Normal file
3
worker/.gitattributes
vendored
Normal file
@@ -0,0 +1,3 @@
|
||||
# entrypoint.sh runs in a Linux container — keep LF so the shebang isn't broken by
|
||||
# Windows CRLF conversion.
|
||||
*.sh text eol=lf
|
||||
3
worker/.gitignore
vendored
Normal file
3
worker/.gitignore
vendored
Normal file
@@ -0,0 +1,3 @@
|
||||
.venv/
|
||||
__pycache__/
|
||||
captures/
|
||||
35
worker/Dockerfile
Normal file
35
worker/Dockerfile
Normal file
@@ -0,0 +1,35 @@
|
||||
# cs.money worker: headful Chromium (nodriver) under a virtual display, with noVNC
|
||||
# so you can open a browser into the container and solve a Cloudflare challenge by hand
|
||||
# if one ever appears. Build context is the repo root (see docker-compose.yml).
|
||||
FROM python:3.13-slim
|
||||
|
||||
# chromium + a virtual X display + VNC bridge + the fonts/libs Chromium needs.
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
chromium \
|
||||
xvfb \
|
||||
x11vnc \
|
||||
novnc \
|
||||
websockify \
|
||||
ca-certificates \
|
||||
fonts-liberation \
|
||||
dumb-init \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
WORKDIR /app
|
||||
COPY worker/requirements.txt ./
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
COPY worker/worker.py worker/entrypoint.sh ./
|
||||
RUN chmod +x entrypoint.sh
|
||||
|
||||
ENV BROWSER_PATH=/usr/bin/chromium \
|
||||
CHROME_NO_SANDBOX=1 \
|
||||
DISPLAY=:99 \
|
||||
SOLVE_SECONDS=45 \
|
||||
PYTHONUNBUFFERED=1
|
||||
|
||||
|
||||
# noVNC web UI (browse http://localhost:6080/vnc.html to watch / solve a challenge).
|
||||
EXPOSE 6080
|
||||
|
||||
# dumb-init reaps the Xvfb/x11vnc/websockify children cleanly.
|
||||
ENTRYPOINT ["dumb-init", "--", "./entrypoint.sh"]
|
||||
72
worker/README.md
Normal file
72
worker/README.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# cs.money worker (Python)
|
||||
|
||||
The browser/Cloudflare layer for the cs.money scraper. .NET stays the **C2**
|
||||
(orchestration, proxy/IP allocation, DB, the sweep loop); this worker is the only
|
||||
component that drives a browser and defeats Cloudflare, because the effective
|
||||
anti-bot tooling (`nodriver`/`undetected-chromedriver`, TLS impersonation) only
|
||||
exists in Python/Go, not .NET.
|
||||
|
||||
## Why nodriver
|
||||
|
||||
.NET Selenium got insta-challenged by Cloudflare's managed challenge because
|
||||
`msedgedriver` controls the browser via the DevTools protocol, leaving `navigator.
|
||||
webdriver` and chromedriver `cdc_` artifacts that Cloudflare keys on. `nodriver`
|
||||
drives a normal Chromium directly over CDP (no chromedriver) and patches those
|
||||
tells, so it passes where Selenium loops.
|
||||
|
||||
## Step 1: prove it (current)
|
||||
|
||||
`poc.py` proves nodriver can clear cs.money's Cloudflare and fetch the listings API
|
||||
before we build the full pull-based fleet.
|
||||
|
||||
```powershell
|
||||
cd worker
|
||||
py -m venv .venv
|
||||
.venv\Scripts\Activate.ps1
|
||||
pip install -r requirements.txt
|
||||
python poc.py
|
||||
```
|
||||
|
||||
A Chromium window opens on the market. Solve the Cloudflare check if shown; the
|
||||
script waits, then pages `sell-orders` deeply (PAGES), reporting how far the warm
|
||||
session survives before any re-challenge and confirming full float precision.
|
||||
Output lands in `worker/captures/`.
|
||||
|
||||
**Targeted skin+wear search.** cs.money search is free-text on the page
|
||||
(`?search=cyber+security+ft`). Set `SEARCH` and the PoC navigates there, **captures
|
||||
the actual filtered `sell-orders` API request the page fires** (so we learn the real
|
||||
filter params instead of guessing), prints it, then pages that filtered API:
|
||||
|
||||
```powershell
|
||||
$env:SEARCH="cyber security ft"; python poc.py # FT M4A4 Cyber Security only
|
||||
```
|
||||
|
||||
The `>>> DISCOVERED sell-orders API call` line shows how the search maps to API
|
||||
params — that's how the C2 will build targeted jobs.
|
||||
|
||||
Run on your own IP first (no proxy) — that's the clean A/B vs. the Selenium run.
|
||||
If auto-detect can't find a browser, set `BROWSER_PATH` to Chrome or Edge
|
||||
(`C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe`).
|
||||
|
||||
## Step 2: the pull fleet
|
||||
|
||||
`worker.py` holds one warm nodriver session and loops: poll the .NET C2 for a job
|
||||
(a skin+wear search), scrape that search's sell-orders via in-page fetch, and post
|
||||
the items back. The C2 (`BlueLaminate.C2`) picks the stalest skin+wear from the
|
||||
catalogue, and on result persists to `cs_money_listings` + `price_history`
|
||||
(`Source = "csmoney"`), stamping `SkinCondition.ListingsSweptAt`.
|
||||
|
||||
Run the C2 (needs Postgres migrated), then the worker:
|
||||
|
||||
```powershell
|
||||
# terminal 1 — the C2 (from repo root)
|
||||
dotnet run --project BlueLaminate\BlueLaminate.C2 # serves http://localhost:5080
|
||||
|
||||
# terminal 2 — the worker
|
||||
cd worker; .venv\Scripts\Activate.ps1
|
||||
$env:WORKER_TOKEN="dev-worker-token" # must match the C2's WorkerToken
|
||||
python worker.py
|
||||
```
|
||||
|
||||
The worker warms the session (you clear Cloudflare once), then runs continuously.
|
||||
Scale out by starting more workers (each with its own `PROXY`).
|
||||
71
worker/diag_consent.py
Normal file
71
worker/diag_consent.py
Normal file
@@ -0,0 +1,71 @@
|
||||
"""
|
||||
Diagnose the cs.money cookie-consent banner so we can dismiss it programmatically.
|
||||
It's likely a Shadow DOM web component (CookieConsentSystem), which is why
|
||||
document.querySelectorAll-based clicks miss the real buttons.
|
||||
|
||||
Saves:
|
||||
captures/_consent.png - screenshot (so we can SEE the banner + button positions)
|
||||
captures/_consent.txt - shadow-host tags + every consent-like button found by
|
||||
piercing shadow roots, with center coordinates.
|
||||
|
||||
cd worker; .venv\\Scripts\\Activate.ps1
|
||||
python diag_consent.py
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import pathlib
|
||||
|
||||
import nodriver as uc
|
||||
|
||||
URL = os.environ.get("URL", "https://cs.money/market/buy/?search=ak-47+redline")
|
||||
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
|
||||
BROWSER_PATH = os.environ.get("BROWSER_PATH")
|
||||
OUT = pathlib.Path(__file__).parent / "captures"
|
||||
|
||||
# Pierce shadow roots to find consent buttons + their viewport-center coords.
|
||||
DEEP_FIND = r"""
|
||||
JSON.stringify((()=>{
|
||||
const hits=[], hosts=[];
|
||||
function walk(root){
|
||||
root.querySelectorAll('*').forEach(e=>{
|
||||
if(e.shadowRoot){ hosts.push(e.tagName.toLowerCase()); walk(e.shadowRoot); }
|
||||
const t=(e.textContent||'').trim();
|
||||
if(t.length<40 && /accept all|manage cookies|reject all|confirm my choice|^accept$|^manage$/i.test(t)){
|
||||
const r=e.getBoundingClientRect();
|
||||
if(r.width>0&&r.height>0)
|
||||
hits.push({tag:e.tagName, text:t, x:Math.round(r.x+r.width/2), y:Math.round(r.y+r.height/2)});
|
||||
}
|
||||
});
|
||||
}
|
||||
walk(document);
|
||||
return {shadowHosts:[...new Set(hosts)], buttons:hits};
|
||||
})())
|
||||
"""
|
||||
|
||||
|
||||
async def main():
|
||||
OUT.mkdir(exist_ok=True)
|
||||
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH)
|
||||
try:
|
||||
page = await browser.get(URL)
|
||||
print(f"Loaded {URL}; waiting {SOLVE_SECONDS}s for Cloudflare...")
|
||||
await page.sleep(SOLVE_SECONDS)
|
||||
|
||||
png = str(OUT / "_consent.png")
|
||||
await page.save_screenshot(png)
|
||||
print(f"screenshot -> {png}")
|
||||
|
||||
raw = await page.evaluate(DEEP_FIND)
|
||||
info = json.loads(raw) if isinstance(raw, str) else {"error": repr(raw)}
|
||||
(OUT / "_consent.txt").write_text(json.dumps(info, indent=2), encoding="utf-8")
|
||||
print("shadow hosts:", info.get("shadowHosts"))
|
||||
print("consent buttons found:")
|
||||
for b in info.get("buttons", []):
|
||||
print(f" {b}")
|
||||
finally:
|
||||
browser.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
183
worker/discover_pagination.py
Normal file
183
worker/discover_pagination.py
Normal file
@@ -0,0 +1,183 @@
|
||||
"""
|
||||
Discover how cs.money paginates a filtered search past the initial ~60 SSR items.
|
||||
|
||||
Tests two hypotheses against a high-result search (default "ak-47 redline", which has
|
||||
well over 60 listings):
|
||||
|
||||
A. Does the SSR page honor offset/limit in the URL? Fetch ?search=...&offset=60 and
|
||||
?search=...&limit=120 and compare item ids to page 1. If disjoint/larger, we can
|
||||
paginate cheaply by re-fetching the page.
|
||||
B. The real client "load more": scroll hard to trigger lazy-load and capture any
|
||||
cs.money /2.0/ XHR via Resource Timing — that request carries the structured
|
||||
filter params + offset, i.e. a lighter direct-API pagination path.
|
||||
|
||||
Findings are printed and saved to captures/_pagination.txt.
|
||||
|
||||
cd worker; .venv\\Scripts\\Activate.ps1
|
||||
python discover_pagination.py
|
||||
$env:SEARCH="ak-47 redline"; python discover_pagination.py # override the search
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import pathlib
|
||||
import re
|
||||
|
||||
import nodriver as uc
|
||||
from nodriver import cdp
|
||||
|
||||
SEARCH = os.environ.get("SEARCH", "ak-47 redline")
|
||||
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
|
||||
BROWSER_PATH = os.environ.get("BROWSER_PATH")
|
||||
PROXY = os.environ.get("PROXY")
|
||||
|
||||
BASE = "https://cs.money/market/buy/"
|
||||
PAGE_PARAMS_RE = re.compile(r'<script\b[^>]*id="__page-params"[^>]*>(.*?)</script>', re.S)
|
||||
OUT = pathlib.Path(__file__).parent / "captures"
|
||||
CONSENT = ["Reject all", "Only necessary", "Reject", "Decline", "Deny"]
|
||||
|
||||
# Aggressive scroll: window + every scrollable container (the grid scrolls in a div,
|
||||
# which is why a plain window.scrollTo didn't trigger lazy-load before).
|
||||
SCROLL_JS = (
|
||||
"window.scrollTo(0, document.body.scrollHeight);"
|
||||
"document.querySelectorAll('*').forEach(e=>{"
|
||||
" if (e.scrollHeight > e.clientHeight + 80) e.scrollTop = e.scrollHeight;});")
|
||||
|
||||
|
||||
async def js(page, expr):
|
||||
raw = await page.evaluate(f"JSON.stringify({expr})")
|
||||
try:
|
||||
return json.loads(raw) if isinstance(raw, str) else None
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return None
|
||||
|
||||
|
||||
async def fetch_text(page, url):
|
||||
expr = (f"fetch({url!r},{{credentials:'include'}}).then(async r=>"
|
||||
f"JSON.stringify({{status:r.status, body:await r.text()}}))")
|
||||
raw = await page.evaluate(expr, await_promise=True)
|
||||
try:
|
||||
o = json.loads(raw)
|
||||
return o.get("status"), o.get("body", "")
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return None, ""
|
||||
|
||||
|
||||
def page_item_ids(html):
|
||||
m = PAGE_PARAMS_RE.search(html or "")
|
||||
if not m:
|
||||
return []
|
||||
try:
|
||||
return [it.get("id") for it in json.loads(m.group(1)).get("inventory", {}).get("items", [])]
|
||||
except json.JSONDecodeError:
|
||||
return []
|
||||
|
||||
|
||||
async def click_visible(page, pattern):
|
||||
"""Click the first VISIBLE element whose trimmed text matches `pattern` (case-
|
||||
insensitive). nodriver's find() was matching hidden/duplicate nodes; restricting
|
||||
to offsetParent!=null + short text hits the real button."""
|
||||
expr = ("JSON.stringify((()=>{"
|
||||
"const re=new RegExp(" + json.dumps(pattern) + ",'i');"
|
||||
"const els=[...document.querySelectorAll('button,a,[role=\"button\"],span,div')];"
|
||||
"const b=els.find(e=>e.offsetParent!==null && (e.textContent||'').trim().length<40 "
|
||||
"&& re.test((e.textContent||'').trim()));"
|
||||
"if(b){b.click();return true}return false})())")
|
||||
r = await page.evaluate(expr)
|
||||
return isinstance(r, str) and "true" in r
|
||||
|
||||
|
||||
async def banner_present(page):
|
||||
r = await page.evaluate(
|
||||
"JSON.stringify(/Manage cookies|Accept all/i.test(document.body.innerText||''))")
|
||||
return isinstance(r, str) and "true" in r
|
||||
|
||||
|
||||
async def dismiss(page):
|
||||
"""Privacy-preserving first (Manage -> Reject all -> Confirm); if the banner is
|
||||
still up, fall back to Accept all so the page becomes interactive (discovery
|
||||
needs scrolling to work)."""
|
||||
steps = []
|
||||
if await click_visible(page, "manage cookies|^manage$"):
|
||||
steps.append("manage")
|
||||
await page.sleep(1.2)
|
||||
if await click_visible(page, "reject all"):
|
||||
steps.append("reject-all")
|
||||
await page.sleep(0.4)
|
||||
for c in ("confirm my choice", "^confirm$", "^save$"):
|
||||
if await click_visible(page, c):
|
||||
steps.append("confirm")
|
||||
break
|
||||
await page.sleep(1)
|
||||
if await banner_present(page):
|
||||
steps.append("still-up->accept" if await click_visible(page, "accept all|^accept$") else "still-up")
|
||||
await page.sleep(0.5)
|
||||
steps.append("gone" if not await banner_present(page) else "STILL-PRESENT")
|
||||
return ", ".join(steps)
|
||||
|
||||
|
||||
async def main():
|
||||
OUT.mkdir(exist_ok=True)
|
||||
args = [f"--proxy-server={PROXY}"] if PROXY else []
|
||||
args.append("--blink-settings=imagesEnabled=false")
|
||||
from urllib.parse import quote_plus
|
||||
q = quote_plus(SEARCH)
|
||||
findings = []
|
||||
|
||||
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH, browser_args=args)
|
||||
try:
|
||||
url0 = f"{BASE}?search={q}"
|
||||
page = await browser.get(url0)
|
||||
print(f"Warming on {url0} ({SOLVE_SECONDS}s for Cloudflare)...")
|
||||
await page.sleep(SOLVE_SECONDS)
|
||||
print(f"Consent: {await dismiss(page)}")
|
||||
|
||||
# --- A. URL offset/limit on the SSR page ---
|
||||
_, h0 = await fetch_text(page, f"{BASE}?search={q}")
|
||||
_, h1 = await fetch_text(page, f"{BASE}?search={q}&offset=60")
|
||||
_, h2 = await fetch_text(page, f"{BASE}?search={q}&limit=120")
|
||||
a, b, c = page_item_ids(h0), page_item_ids(h1), page_item_ids(h2)
|
||||
overlap = len(set(a) & set(b))
|
||||
findings.append(f"page1 ids={len(a)} offset=60 ids={len(b)} (overlap with page1={overlap}) limit=120 ids={len(c)}")
|
||||
findings.append(f" -> offset works? {'YES (disjoint)' if b and overlap == 0 else 'no/ignored'}")
|
||||
findings.append(f" -> limit works? {'YES (>60)' if len(c) > 60 else 'no/ignored'}")
|
||||
|
||||
# --- B. Trigger client load-more, capture cs.money /2.0/ XHRs ---
|
||||
# Infinite scroll only fires on GRADUAL downward scrolling — jumping to the
|
||||
# bottom skips the trigger. So step down in small wheel increments and watch
|
||||
# the item count grow.
|
||||
before = set(await js(page, "performance.getEntriesByType('resource').map(e=>e.name)") or [])
|
||||
async def card_count():
|
||||
n = await page.evaluate(
|
||||
"JSON.stringify(document.querySelectorAll('[href*=\"/item/\"],[class*=\"item\" i]').length)")
|
||||
return n
|
||||
print(f" cards before scroll: {await card_count()}")
|
||||
for step in range(60):
|
||||
try:
|
||||
await page.send(cdp.input_.dispatch_mouse_event(
|
||||
type_="mouseWheel", x=720, y=450, delta_x=0, delta_y=500))
|
||||
except Exception:
|
||||
pass
|
||||
await page.sleep(0.7)
|
||||
if step % 15 == 14:
|
||||
now = [u for u in (await js(page, "performance.getEntriesByType('resource').map(e=>e.name)") or [])
|
||||
if u not in before and "cs.money" in u and "metrics." not in u and "traces." not in u]
|
||||
print(f" step {step+1}: cards={await card_count()} new cs.money reqs={len(now)}")
|
||||
after = await js(page, "performance.getEntriesByType('resource').map(e=>e.name)") or []
|
||||
new_xhrs = [u for u in after if u not in before and "cs.money" in u
|
||||
and "metrics." not in u and "traces." not in u]
|
||||
findings.append(f"\nclient requests after scrolling ({len(new_xhrs)} new cs.money):")
|
||||
findings.extend(f" {u}" for u in dict.fromkeys(new_xhrs))
|
||||
if not new_xhrs:
|
||||
findings.append(" (none — grid may not lazy-load via XHR, or scroll didn't reach the trigger)")
|
||||
|
||||
report = "\n".join(findings)
|
||||
print("\n=== FINDINGS ===\n" + report)
|
||||
(OUT / "_pagination.txt").write_text(f"search: {SEARCH}\n\n{report}\n", encoding="utf-8")
|
||||
print(f"\nsaved to {OUT / '_pagination.txt'}")
|
||||
finally:
|
||||
browser.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
96
worker/discover_price_param.py
Normal file
96
worker/discover_price_param.py
Normal file
@@ -0,0 +1,96 @@
|
||||
"""
|
||||
Find cs.money's price-filter URL param (the basis for price-bucket pagination).
|
||||
|
||||
The market has a Price from/to filter in the sidebar. `search=` works via the URL and
|
||||
the page SSRs the filtered listings into __page-params, so a price param likely works
|
||||
the same way. We baseline the cheapest set, then try candidate param names with a high
|
||||
floor and check whether the returned listings actually shift above it.
|
||||
|
||||
cd worker; .venv\\Scripts\\Activate.ps1
|
||||
python discover_price_param.py
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import pathlib
|
||||
import re
|
||||
from urllib.parse import quote_plus
|
||||
|
||||
import nodriver as uc
|
||||
|
||||
SEARCH = os.environ.get("SEARCH", "ak-47 redline")
|
||||
FLOOR = float(os.environ.get("FLOOR", "200"))
|
||||
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
|
||||
BROWSER_PATH = os.environ.get("BROWSER_PATH")
|
||||
BASE = "https://cs.money/market/buy/"
|
||||
PP = re.compile(r'<script\b[^>]*id="__page-params"[^>]*>(.*?)</script>', re.S)
|
||||
OUT = pathlib.Path(__file__).parent / "captures"
|
||||
|
||||
# Param-name variants for a price floor (and a couple of from/to pairs).
|
||||
CANDIDATES = [
|
||||
"minPrice", "priceFrom", "price_from", "priceMin", "min_price",
|
||||
"priceGte", "from", "price_min", "minprice", "price.gte", "pricegte",
|
||||
]
|
||||
|
||||
|
||||
async def fetch_prices(page, url):
|
||||
expr = (f"fetch({url!r},{{credentials:'include'}}).then(async r=>"
|
||||
f"JSON.stringify({{status:r.status, body:await r.text()}}))")
|
||||
raw = await page.evaluate(expr, await_promise=True)
|
||||
try:
|
||||
body = json.loads(raw).get("body", "")
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return None
|
||||
m = PP.search(body or "")
|
||||
if not m:
|
||||
return None
|
||||
try:
|
||||
items = json.loads(m.group(1)).get("inventory", {}).get("items", [])
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
return [it.get("pricing", {}) for it in items if it.get("pricing")]
|
||||
|
||||
|
||||
async def main():
|
||||
OUT.mkdir(exist_ok=True)
|
||||
q = quote_plus(SEARCH)
|
||||
lines = []
|
||||
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH,
|
||||
browser_args=["--blink-settings=imagesEnabled=false"])
|
||||
try:
|
||||
page = await browser.get(f"{BASE}?search={q}")
|
||||
print(f"Warming ({SOLVE_SECONDS}s)..."); await page.sleep(SOLVE_SECONDS)
|
||||
|
||||
# Test minPrice/maxPrice semantics directly (old cs.money API used these).
|
||||
tests = [
|
||||
("baseline", f"{BASE}?search={q}"),
|
||||
("maxPrice=200", f"{BASE}?search={q}&maxPrice=200"),
|
||||
("minPrice=300", f"{BASE}?search={q}&minPrice=300"),
|
||||
("minPrice=300&maxPrice=400", f"{BASE}?search={q}&minPrice=300&maxPrice=400"),
|
||||
("minPrice=500&maxPrice=1000", f"{BASE}?search={q}&minPrice=500&maxPrice=1000"),
|
||||
]
|
||||
def rng(pr, field):
|
||||
vals = [p.get(field) for p in pr if isinstance(p.get(field), (int, float))]
|
||||
return (min(vals), max(vals)) if vals else (None, None)
|
||||
|
||||
for name, url in tests:
|
||||
pr = await fetch_prices(page, url)
|
||||
if not pr:
|
||||
lines.append(f"{name:28} -> no items")
|
||||
else:
|
||||
d0, d1 = rng(pr, "default")
|
||||
c0, c1 = rng(pr, "computed")
|
||||
b0, b1 = rng(pr, "basePrice")
|
||||
lines.append(f"{name:28} -> n={len(pr)} default[{d0:.2f},{d1:.2f}] "
|
||||
f"computed[{c0:.2f},{c1:.2f}] base[{b0:.2f},{b1:.2f}]")
|
||||
print(lines[-1])
|
||||
|
||||
(OUT / "_price_param.txt").write_text(
|
||||
f"search={SEARCH} floor={FLOOR}\n\n" + "\n".join(lines), encoding="utf-8")
|
||||
print(f"\nsaved to {OUT/'_price_param.txt'}")
|
||||
finally:
|
||||
browser.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
19
worker/entrypoint.sh
Normal file
19
worker/entrypoint.sh
Normal file
@@ -0,0 +1,19 @@
|
||||
#!/usr/bin/env bash
|
||||
# Start a virtual display, expose it over noVNC, then run the worker headful against it.
|
||||
set -euo pipefail
|
||||
|
||||
DISPLAY_NUM="${DISPLAY:-:99}"
|
||||
SCREEN="${SCREEN_GEOMETRY:-1440x900x24}"
|
||||
|
||||
echo "[entrypoint] starting Xvfb on ${DISPLAY_NUM} (${SCREEN})"
|
||||
Xvfb "${DISPLAY_NUM}" -screen 0 "${SCREEN}" -nolisten tcp &
|
||||
sleep 1
|
||||
|
||||
echo "[entrypoint] starting x11vnc (display ${DISPLAY_NUM} -> :5900)"
|
||||
x11vnc -display "${DISPLAY_NUM}" -forever -shared -nopw -quiet -bg
|
||||
|
||||
echo "[entrypoint] starting noVNC on :6080 (open http://localhost:6080/vnc.html)"
|
||||
websockify --web=/usr/share/novnc 6080 localhost:5900 &
|
||||
|
||||
echo "[entrypoint] launching worker"
|
||||
exec python worker.py
|
||||
285
worker/poc.py
Normal file
285
worker/poc.py
Normal file
@@ -0,0 +1,285 @@
|
||||
"""
|
||||
Proof-of-concept / pre-fleet validation for the cs.money scraper.
|
||||
|
||||
Proves the things we need before building the C2 + worker fleet:
|
||||
1. nodriver clears cs.money's Cloudflare where .NET Selenium couldn't.
|
||||
2. a single WARM session can page the sell-orders API deeply without re-challenge.
|
||||
3. a free-text market search (e.g. "cyber security ft") can be turned into a
|
||||
filtered sell-orders API call — we DISCOVER the real API params by capturing the
|
||||
request the page itself fires, instead of guessing.
|
||||
|
||||
It opens the market (optionally a search URL) in a real non-headless Chromium, lets
|
||||
you clear Cloudflare, dismisses the cookie banner (privacy-preserving), captures the
|
||||
sell-orders request the page makes, then pages that API from inside the cleared page
|
||||
(same-origin fetch carries cf_clearance), pacing itself and stopping on re-challenge.
|
||||
|
||||
cd worker
|
||||
.venv\\Scripts\\Activate.ps1
|
||||
pip install -r requirements.txt
|
||||
|
||||
python poc.py # whole-market sweep
|
||||
$env:SEARCH="cyber security ft"; python poc.py # targeted: FT M4A4 Cyber Security
|
||||
|
||||
Env knobs (all optional):
|
||||
SEARCH free-text market search; when set, scrape only those results
|
||||
MARKET_URL market page base (default the buy market)
|
||||
SOLVE_SECONDS seconds to wait for you to clear Cloudflare (default 30)
|
||||
PAGES how many offset pages (60 each) to attempt (default 20)
|
||||
START_OFFSET first offset (default 0)
|
||||
DELAY / JITTER base + random seconds between fetches (default 2.0 / 1.5)
|
||||
PROXY host:port for an auth-free proxy (omit to use your own IP)
|
||||
BROWSER_PATH path to Chrome/Edge if auto-detect fails
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import pathlib
|
||||
import random
|
||||
from urllib.parse import quote_plus, urlsplit, parse_qsl, urlencode, urlunsplit
|
||||
|
||||
import nodriver as uc
|
||||
from nodriver import cdp
|
||||
|
||||
SEARCH = os.environ.get("SEARCH")
|
||||
MARKET_URL = os.environ.get("MARKET_URL", "https://cs.money/market/buy/")
|
||||
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
|
||||
PAGES = int(os.environ.get("PAGES", "20"))
|
||||
START_OFFSET = int(os.environ.get("START_OFFSET", "0"))
|
||||
DELAY = float(os.environ.get("DELAY", "2.0"))
|
||||
JITTER = float(os.environ.get("JITTER", "1.5"))
|
||||
PROXY = os.environ.get("PROXY")
|
||||
BROWSER_PATH = os.environ.get("BROWSER_PATH")
|
||||
|
||||
# Fallback template if we fail to capture the page's own request (offset = {}).
|
||||
DEFAULT_TEMPLATE = "https://cs.money/2.0/market/sell-orders?limit=60&offset={}"
|
||||
OUT_DIR = pathlib.Path(__file__).parent / "captures"
|
||||
CONSENT_LABELS = ["Reject all", "Reject All", "Only necessary", "Necessary only",
|
||||
"Reject", "Decline", "Deny"]
|
||||
|
||||
# Filled by the CDP network handler with sell-orders request URLs the page fires.
|
||||
_seen_urls: list[str] = []
|
||||
|
||||
|
||||
def looks_like_challenge(body: str) -> bool:
|
||||
s = (body or "").lstrip()
|
||||
return not s or s.startswith("<") or "Just a moment" in body or "challenge-platform" in body
|
||||
|
||||
|
||||
def decimals(v: float) -> int:
|
||||
r = repr(float(v))
|
||||
return len(r.split(".")[-1]) if "." in r else 0
|
||||
|
||||
|
||||
def template_from(url: str) -> str:
|
||||
"""Turn a captured sell-orders URL into a template with offset as '{}',
|
||||
preserving every other param (the search/filter encoding we want to learn)."""
|
||||
parts = urlsplit(url)
|
||||
q = [(k, v) for k, v in parse_qsl(parts.query, keep_blank_values=True) if k != "offset"]
|
||||
if not any(k == "limit" for k, _ in q):
|
||||
q.append(("limit", "60"))
|
||||
base_q = urlencode(q)
|
||||
new_q = (base_q + "&" if base_q else "") + "offset={}"
|
||||
return urlunsplit((parts.scheme, parts.netloc, parts.path, new_q, ""))
|
||||
|
||||
|
||||
async def dismiss_consent(page) -> str | None:
|
||||
"""Best-effort, privacy-preserving — never clicks 'Accept all'."""
|
||||
for label in CONSENT_LABELS:
|
||||
try:
|
||||
el = await page.find(label, best_match=True, timeout=2)
|
||||
except Exception:
|
||||
el = None
|
||||
if el:
|
||||
try:
|
||||
await el.click()
|
||||
return label
|
||||
except Exception:
|
||||
pass
|
||||
return None
|
||||
|
||||
|
||||
async def fetch_json(page, url: str) -> tuple[str, str]:
|
||||
expr = (
|
||||
f"fetch({url!r}, {{credentials:'include', headers:{{'accept':'application/json'}}}})"
|
||||
f".then(async r => JSON.stringify({{status: r.status, body: await r.text()}}))"
|
||||
)
|
||||
raw = await page.evaluate(expr, await_promise=True)
|
||||
if not isinstance(raw, str):
|
||||
return ("-1", "")
|
||||
try:
|
||||
obj = json.loads(raw)
|
||||
return (str(obj.get("status", "-1")), obj.get("body", ""))
|
||||
except json.JSONDecodeError:
|
||||
return ("-1", raw)
|
||||
|
||||
|
||||
async def main():
|
||||
OUT_DIR.mkdir(exist_ok=True)
|
||||
args = [f"--proxy-server={PROXY}"] if PROXY else []
|
||||
|
||||
target_url = MARKET_URL
|
||||
tag = "market"
|
||||
if SEARCH:
|
||||
sep = "&" if "?" in MARKET_URL else "?"
|
||||
target_url = f"{MARKET_URL}{sep}search={quote_plus(SEARCH)}"
|
||||
tag = "search_" + "".join(c if c.isalnum() else "_" for c in SEARCH)[:40]
|
||||
|
||||
print(f"Launching nodriver Chromium (proxy={PROXY or 'none / own IP'})...")
|
||||
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH, browser_args=args)
|
||||
|
||||
pages_ok = items_total = floats_total = low_prec = 0
|
||||
dp_min, dp_max = 99, 0
|
||||
deepest_offset = None
|
||||
reason = "completed (hit PAGES limit)"
|
||||
|
||||
try:
|
||||
# Open a blank tab first so the network handler is attached BEFORE the page
|
||||
# fires its filtered sell-orders request (otherwise we'd miss it).
|
||||
page = await browser.get("about:blank")
|
||||
|
||||
async def on_request(evt):
|
||||
url = evt.request.url
|
||||
if "/market/sell-orders" in url:
|
||||
_seen_urls.append(url)
|
||||
|
||||
page.add_handler(cdp.network.RequestWillBeSent, on_request)
|
||||
try:
|
||||
await page.send(cdp.network.enable())
|
||||
except Exception as ex:
|
||||
print(f"(network capture unavailable: {ex})")
|
||||
|
||||
print(f"Opening {target_url}")
|
||||
await page.get(target_url)
|
||||
print(f"Solve any Cloudflare challenge. Waiting {SOLVE_SECONDS}s for the grid...")
|
||||
await page.sleep(SOLVE_SECONDS)
|
||||
|
||||
clicked = await dismiss_consent(page)
|
||||
print(f"Consent banner: {'dismissed via ' + clicked if clicked else 'left up (does not block fetch)'}")
|
||||
|
||||
# Reliable discovery via the Resource Timing API: the browser records EVERY
|
||||
# request the page made, so we read the real sell-orders URL straight out of it
|
||||
# (no flaky CDP event timing). Also dump nearby API calls for context.
|
||||
# cs.money is an Astro SSR app — the initial filtered listings are rendered
|
||||
# server-side (no client XHR to capture). Scroll to provoke lazy-load
|
||||
# pagination, which DOES fire a client request carrying the real filter params.
|
||||
print("Scrolling to trigger lazy-load pagination...")
|
||||
for _ in range(6):
|
||||
try:
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
except Exception:
|
||||
pass
|
||||
await page.sleep(2)
|
||||
|
||||
# nodriver returns arrays unreliably from evaluate(), so JSON.stringify in JS
|
||||
# and json.loads here (the string path is proven by fetch_json).
|
||||
async def js_list(expr: str) -> list:
|
||||
raw = await page.evaluate(f"JSON.stringify({expr})")
|
||||
try:
|
||||
return json.loads(raw) if isinstance(raw, str) else []
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return []
|
||||
|
||||
try:
|
||||
all_urls = await js_list("performance.getEntriesByType('resource').map(e=>e.name)")
|
||||
print(f">>> Resource Timing saw {len(all_urls)} requests total")
|
||||
if all_urls:
|
||||
(OUT_DIR / "_all_requests.txt").write_text(
|
||||
"\n".join(dict.fromkeys(all_urls)), encoding="utf-8")
|
||||
sell = [u for u in all_urls if "/market/sell-orders" in u]
|
||||
_seen_urls.extend(sell)
|
||||
api = [u for u in all_urls if "cs.money/" in u and ("/2.0/" in u or "/1.0/" in u)]
|
||||
if api:
|
||||
(OUT_DIR / "_api_calls.txt").write_text("\n".join(dict.fromkeys(api)), encoding="utf-8")
|
||||
print(f">>> {len(set(api))} cs.money API calls; saved to {OUT_DIR / '_api_calls.txt'}")
|
||||
except Exception as ex:
|
||||
print(f"(resource-timing query failed: {ex})")
|
||||
|
||||
# Dump the SSR'd page so we can see how the filter is encoded and where the
|
||||
# listings data lives (Astro embeds island props / hydration JSON in the HTML).
|
||||
try:
|
||||
html = await page.evaluate("document.documentElement.outerHTML")
|
||||
if isinstance(html, str) and html:
|
||||
(OUT_DIR / "_page.html").write_text(html, encoding="utf-8")
|
||||
print(f">>> saved page HTML ({len(html)} bytes) to {OUT_DIR / '_page.html'}")
|
||||
except Exception as ex:
|
||||
print(f"(page HTML dump failed: {ex})")
|
||||
|
||||
# Discovery: what sell-orders request did the page actually make?
|
||||
if _seen_urls:
|
||||
captured = _seen_urls[-1]
|
||||
template = template_from(captured)
|
||||
print("\n>>> DISCOVERED sell-orders API call the page fired:")
|
||||
print(f" {captured}")
|
||||
print(f">>> pagination template: {template}\n")
|
||||
# Persist it — the console line is easy to lose, and this is the one bit
|
||||
# of ground truth (the real filter-param scheme) we need.
|
||||
(OUT_DIR / "_discovered.txt").write_text(
|
||||
"ALL captured sell-orders requests:\n"
|
||||
+ "\n".join(dict.fromkeys(_seen_urls))
|
||||
+ f"\n\npagination template:\n{template}\n",
|
||||
encoding="utf-8")
|
||||
print(f">>> saved to {OUT_DIR / '_discovered.txt'}")
|
||||
else:
|
||||
template = DEFAULT_TEMPLATE
|
||||
if SEARCH:
|
||||
template = template.replace("offset={}", f"search={quote_plus(SEARCH)}&offset={{}}")
|
||||
print(f"\n(no request captured; falling back to template: {template})\n")
|
||||
|
||||
for i in range(PAGES):
|
||||
offset = START_OFFSET + i * 60
|
||||
status, body = await fetch_json(page, template.format(offset))
|
||||
|
||||
if looks_like_challenge(body):
|
||||
print(f" page {i + 1} [offset {offset}]: RE-CHALLENGED (status {status}). Stopping.")
|
||||
(OUT_DIR / f"{tag}_challenge_offset_{offset}.html").write_text(body, encoding="utf-8")
|
||||
reason = f"re-challenged at offset {offset}"
|
||||
break
|
||||
|
||||
try:
|
||||
items = json.loads(body).get("items", [])
|
||||
except json.JSONDecodeError:
|
||||
print(f" page {i + 1} [offset {offset}]: non-JSON (status {status}). Stopping.")
|
||||
reason = f"non-JSON at offset {offset}"
|
||||
break
|
||||
|
||||
if not items:
|
||||
print(f" page {i + 1} [offset {offset}]: 0 items — end of results.")
|
||||
reason = "end of results"
|
||||
break
|
||||
|
||||
(OUT_DIR / f"{tag}_offset_{offset:06d}.json").write_text(body, encoding="utf-8")
|
||||
pages_ok += 1
|
||||
deepest_offset = offset
|
||||
items_total += len(items)
|
||||
names = set()
|
||||
for it in items:
|
||||
fl = it.get("asset", {}).get("float")
|
||||
if fl is not None:
|
||||
floats_total += 1
|
||||
d = decimals(fl)
|
||||
dp_min, dp_max = min(dp_min, d), max(dp_max, d)
|
||||
if d <= 6: # short repr — exact binary fraction (e.g. 1/16), not truncation
|
||||
low_prec += 1
|
||||
names.add(it.get("asset", {}).get("names", {}).get("full"))
|
||||
sample = next(iter(names), None) if SEARCH else None
|
||||
print(f" page {i + 1} [offset {offset}] OK — {len(items)} items"
|
||||
+ (f" (e.g. {sample}; {len(names)} distinct names)" if SEARCH else ""))
|
||||
|
||||
await page.sleep(DELAY + random.uniform(0, JITTER))
|
||||
|
||||
print("\n=== summary ===")
|
||||
print(f" query: {SEARCH or '(whole market)'}")
|
||||
print(f" stopped: {reason}")
|
||||
print(f" clean pages: {pages_ok} deepest offset: {deepest_offset} items: {items_total}")
|
||||
if floats_total:
|
||||
# Truncation would make MANY values short, not one exact binary fraction.
|
||||
verdict = "FULL precision" if low_prec / floats_total < 0.02 else "POSSIBLE TRUNCATION"
|
||||
print(f" floats: {floats_total} items, {dp_max}-decimal max, "
|
||||
f"{low_prec} short-repr (exact fractions) — {verdict}")
|
||||
print(f" files in {OUT_DIR}")
|
||||
finally:
|
||||
browser.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
77
worker/probe_filters.py
Normal file
77
worker/probe_filters.py
Normal file
@@ -0,0 +1,77 @@
|
||||
"""
|
||||
Probe which extra filter params cs.money's SSR market search honors, so we can
|
||||
pick a SECOND pagination axis to break apart dense price bands that saturate the
|
||||
60-cap (see diag_windows.py). For a saturating search we try candidate params and
|
||||
report how the returned set's size + float range + price range change.
|
||||
|
||||
python probe_filters.py "Glock-18 Candy Apple mw"
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import sys
|
||||
|
||||
import nodriver as uc
|
||||
|
||||
import worker
|
||||
|
||||
BASE = "https://cs.money/market/buy/?search={q}"
|
||||
# (label, extra query string) — candidates cs.money markets commonly expose.
|
||||
CANDIDATES = [
|
||||
("baseline", ""),
|
||||
("sort=price asc", "&order=asc&sort=price"),
|
||||
("sort=price desc", "&order=desc&sort=price"),
|
||||
("sort=float", "&sort=float"),
|
||||
("minFloat/maxFloat lo", "&minFloat=0.07&maxFloat=0.10"),
|
||||
("minFloat/maxFloat hi", "&minFloat=0.10&maxFloat=0.15"),
|
||||
("maxWear lo", "&minWear=0.07&maxWear=0.10"),
|
||||
("isStatTrak=true", "&isStatTrak=true"),
|
||||
("hasStickers=false", "&hasStickers=false"),
|
||||
]
|
||||
|
||||
|
||||
def stats(items):
|
||||
floats = [(((it.get("asset") or {}).get("float"))) for it in items]
|
||||
floats = [f for f in floats if isinstance(f, (int, float))]
|
||||
bases = []
|
||||
for it in items:
|
||||
p = it.get("pricing") or {}
|
||||
b = p.get("basePrice", p.get("computed"))
|
||||
if isinstance(b, (int, float)):
|
||||
bases.append(b)
|
||||
fr = f"[{min(floats):.4f},{max(floats):.4f}]" if floats else "[-]"
|
||||
br = f"[{min(bases):.2f},{max(bases):.2f}]" if bases else "[-]"
|
||||
return f"n={len(items):3d} float{fr} base{br}"
|
||||
|
||||
|
||||
async def main():
|
||||
search = " ".join(sys.argv[1:]) or "Glock-18 Candy Apple mw"
|
||||
q = worker.urllib.parse.quote_plus(search)
|
||||
|
||||
args = ["--blink-settings=imagesEnabled=false"]
|
||||
browser = await uc.start(headless=False, browser_args=args)
|
||||
try:
|
||||
page = await browser.get("about:blank")
|
||||
await worker.warm(page)
|
||||
|
||||
base_ids = None
|
||||
for label, extra in CANDIDATES:
|
||||
url = BASE.format(q=q) + extra
|
||||
status, body = await worker.fetch_json(page, url)
|
||||
if "Just a moment" in body or "challenge-platform" in body:
|
||||
print(f" {label:24s} CHALLENGED"); break
|
||||
items = worker.extract_items(body)
|
||||
ids = {it.get("id") for it in items}
|
||||
if label == "baseline":
|
||||
base_ids = ids
|
||||
delta = ""
|
||||
else:
|
||||
# If a param is IGNORED, the set is identical to baseline.
|
||||
delta = "IGNORED (== baseline)" if ids == base_ids else f"CHANGED ({len(ids ^ (base_ids or set()))} diff ids)"
|
||||
print(f" {label:24s} {stats(items)} {delta}")
|
||||
await page.sleep(worker.DELAY)
|
||||
finally:
|
||||
browser.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
5
worker/requirements.txt
Normal file
5
worker/requirements.txt
Normal file
@@ -0,0 +1,5 @@
|
||||
# cs.money scraping worker.
|
||||
# nodriver = the modern successor to undetected-chromedriver: it drives a normal
|
||||
# Chromium over CDP directly (no chromedriver, so none of the cdc_/webdriver tells
|
||||
# that got our .NET Selenium setup insta-challenged by Cloudflare).
|
||||
nodriver>=0.39
|
||||
77
worker/verify_count.py
Normal file
77
worker/verify_count.py
Normal file
@@ -0,0 +1,77 @@
|
||||
"""
|
||||
One-off count verification: scrape a single skin+wear search from cs.money and
|
||||
report how many distinct sell-orders come back, reusing the production worker's
|
||||
warm-session + price-window bisection logic (worker.scrape_job).
|
||||
|
||||
Use it to sanity-check that our pagination actually recovers the FULL listing
|
||||
count cs.money shows on the site (the known ground truth) for one query.
|
||||
|
||||
cd worker
|
||||
.venv\\Scripts\\Activate.ps1
|
||||
python verify_count.py "Desert Eagle Bronze Deco fn"
|
||||
|
||||
Env knobs (same meaning as worker.py): SOLVE_SECONDS, DELAY, JITTER, PROXY,
|
||||
BROWSER_PATH, LOAD_IMAGES. MAX_FETCHES caps window fetches (default 80).
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import sys
|
||||
from collections import Counter
|
||||
|
||||
import nodriver as uc
|
||||
|
||||
import worker
|
||||
|
||||
MAX_FETCHES = int(os.environ.get("MAX_FETCHES", "80"))
|
||||
|
||||
|
||||
async def main():
|
||||
search = " ".join(sys.argv[1:]) or "Desert Eagle Bronze Deco fn"
|
||||
|
||||
args = [f"--proxy-server={worker.PROXY}"] if worker.PROXY else []
|
||||
if not worker.LOAD_IMAGES:
|
||||
args.append("--blink-settings=imagesEnabled=false")
|
||||
if os.environ.get("CHROME_NO_SANDBOX") == "1":
|
||||
args += ["--no-sandbox", "--disable-dev-shm-usage"]
|
||||
|
||||
print(f"Verifying count for search {search!r} (proxy={worker.PROXY or 'own IP'})")
|
||||
browser = await uc.start(
|
||||
headless=False, browser_executable_path=worker.BROWSER_PATH, browser_args=args)
|
||||
try:
|
||||
page = await browser.get("about:blank")
|
||||
await worker.warm(page)
|
||||
|
||||
job = {"search": search, "maxPages": MAX_FETCHES}
|
||||
items, fetches, reason = await worker.scrape_job(page, job)
|
||||
|
||||
print("\n=== result ===")
|
||||
print(f" search: {search}")
|
||||
print(f" stopped: {reason}")
|
||||
print(f" fetches: {fetches}")
|
||||
print(f" DISTINCT sell-orders (deduped by id): {len(items)}")
|
||||
|
||||
# Break down what came back so we can see whether the count is inflated by
|
||||
# off-target names/wears (the C2's name+wear filter would drop those later).
|
||||
names = Counter()
|
||||
wears = Counter()
|
||||
st = 0
|
||||
for it in items:
|
||||
asset = it.get("asset") or {}
|
||||
names[(asset.get("names") or {}).get("full")] += 1
|
||||
wears[asset.get("quality")] += 1
|
||||
if asset.get("isStatTrak"):
|
||||
st += 1
|
||||
print(f" StatTrak in set: {st}")
|
||||
print(" by name:")
|
||||
for name, n in names.most_common():
|
||||
print(f" {n:4d} {name}")
|
||||
print(" by wear (quality code):")
|
||||
for w, n in wears.most_common():
|
||||
print(f" {n:4d} {w}")
|
||||
finally:
|
||||
browser.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
79
worker/verify_crosscheck.py
Normal file
79
worker/verify_crosscheck.py
Normal file
@@ -0,0 +1,79 @@
|
||||
"""
|
||||
Validate the float-cursor scrape by walking the float axis in BOTH directions and
|
||||
comparing the recovered sell-order id sets. If ascending (lowest float first) and
|
||||
descending (highest float first) independently land on the same listings, the
|
||||
cursor is exhaustive and order-independent — i.e. the count is real, not an artifact
|
||||
of walk direction or boundary double-counting.
|
||||
|
||||
python verify_crosscheck.py "Glock-18 Candy Apple mw"
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import sys
|
||||
|
||||
import nodriver as uc
|
||||
|
||||
import worker
|
||||
|
||||
CAP = worker.PAGE_CAP
|
||||
ASC = ("https://cs.money/market/buy/?search={q}"
|
||||
"&order=asc&sort=float&minFloat={cur:.12f}&maxFloat=1")
|
||||
DESC = ("https://cs.money/market/buy/?search={q}"
|
||||
"&order=desc&sort=float&minFloat=0&maxFloat={cur:.12f}")
|
||||
|
||||
|
||||
async def walk(page, q, template, ascending, max_fetches=60):
|
||||
seen = {}
|
||||
cur = 0.0 if ascending else 1.0
|
||||
fetches = 0
|
||||
while fetches < max_fetches:
|
||||
status, body = await worker.fetch_json(page, template.format(q=q, cur=cur))
|
||||
fetches += 1
|
||||
if "Just a moment" in body or "challenge-platform" in body:
|
||||
return seen, fetches, "challenged"
|
||||
items = worker.extract_items(body)
|
||||
floats = []
|
||||
for it in items:
|
||||
if it.get("id") is not None:
|
||||
seen[it["id"]] = it
|
||||
fl = (it.get("asset") or {}).get("float")
|
||||
if isinstance(fl, (int, float)):
|
||||
floats.append(fl)
|
||||
if len(items) < CAP:
|
||||
return seen, fetches, "completed"
|
||||
nxt = (max(floats) if ascending else min(floats)) if floats else None
|
||||
if nxt is None or (ascending and nxt <= cur) or (not ascending and nxt >= cur):
|
||||
return seen, fetches, "stuck"
|
||||
cur = nxt
|
||||
await page.sleep(worker.DELAY)
|
||||
return seen, fetches, "fetch-cap"
|
||||
|
||||
|
||||
async def main():
|
||||
search = " ".join(sys.argv[1:]) or "Glock-18 Candy Apple mw"
|
||||
q = worker.urllib.parse.quote_plus(search)
|
||||
browser = await uc.start(headless=False, browser_args=["--blink-settings=imagesEnabled=false"])
|
||||
try:
|
||||
page = await browser.get("about:blank")
|
||||
await worker.warm(page)
|
||||
|
||||
asc, fa, ra = await walk(page, q, ASC, ascending=True)
|
||||
print(f"ASC : {len(asc):4d} ids {fa} fetches {ra}")
|
||||
desc, fd, rd = await walk(page, q, DESC, ascending=False)
|
||||
print(f"DESC: {len(desc):4d} ids {fd} fetches {rd}")
|
||||
|
||||
a, d = set(asc), set(desc)
|
||||
union = a | d
|
||||
print("\n=== cross-check ===")
|
||||
print(f" ASC only: {len(a - d)}")
|
||||
print(f" DESC only: {len(d - a)}")
|
||||
print(f" in both: {len(a & d)}")
|
||||
print(f" UNION (distinct):{len(union)}")
|
||||
agree = "AGREE — count is solid" if a == d else "DISAGREE — one walk missed listings"
|
||||
print(f" verdict: {agree}")
|
||||
finally:
|
||||
browser.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
453
worker/worker.py
Normal file
453
worker/worker.py
Normal file
@@ -0,0 +1,453 @@
|
||||
"""
|
||||
cs.money scrape worker (pull model).
|
||||
|
||||
Holds ONE warm nodriver session (the thing that beats Cloudflare), then loops:
|
||||
poll the .NET C2 for a job, scrape that skin+wear's sell-orders via in-page fetch
|
||||
from the cleared session, and post the results back. The C2 owns job selection
|
||||
(stalest skin+wear first) and persistence; this worker just fetches and forwards.
|
||||
|
||||
cd worker
|
||||
.venv\\Scripts\\Activate.ps1
|
||||
pip install -r requirements.txt
|
||||
python worker.py
|
||||
|
||||
Env knobs:
|
||||
C2_URL C2 base URL (default http://localhost:5080)
|
||||
WORKER_TOKEN shared secret, must match the C2's WorkerToken (default dev-worker-token)
|
||||
MARKET_URL market page to warm the session on (default the buy market)
|
||||
SOLVE_SECONDS seconds to clear Cloudflare on startup (default 30)
|
||||
DELAY / JITTER base + random seconds between page fetches (default 2.0 / 1.5)
|
||||
IDLE_SECONDS sleep when the C2 has no work (default 10)
|
||||
BROWSER_PATH path to Chrome/Edge if auto-detect fails
|
||||
|
||||
Proxy (pick one; IPRoyal takes priority when its creds are set):
|
||||
IPROYAL_USERNAME IPRoyal residential account username
|
||||
IPROYAL_PASSWORD IPRoyal residential account password
|
||||
IPROYAL_COUNTRY ISO country for the exit (default us; blank = any)
|
||||
IPROYAL_LIFETIME_MIN sticky-IP hold in minutes (default 60)
|
||||
PROXY host:port for an auth-free proxy (fallback; omit to use your own IP)
|
||||
|
||||
Each worker process mints its own random IPRoyal sticky session at startup, so N
|
||||
workers get N distinct residential exit IPs with no coordination — scale with
|
||||
`docker compose up --scale worker=N`. On a Cloudflare challenge the worker rotates
|
||||
to a fresh session (new IP) and re-warms. Chromium can't carry proxy credentials on
|
||||
--proxy-server, so we run a tiny in-process forwarder (LocalForwardingProxy below)
|
||||
that injects the IPRoyal auth and chains to the gateway; Chrome talks only to an
|
||||
auth-free 127.0.0.1 endpoint, keeping us at zero CDP (a CDP auth handler is a
|
||||
Cloudflare tell).
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import base64
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import urllib.error
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
import uuid
|
||||
|
||||
import nodriver as uc
|
||||
|
||||
C2_URL = os.environ.get("C2_URL", "http://localhost:5080").rstrip("/")
|
||||
TOKEN = os.environ.get("WORKER_TOKEN", "dev-worker-token")
|
||||
MARKET_URL = os.environ.get("MARKET_URL", "https://cs.money/market/buy/")
|
||||
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
|
||||
DELAY = float(os.environ.get("DELAY", "2.0"))
|
||||
JITTER = float(os.environ.get("JITTER", "1.5"))
|
||||
IDLE_SECONDS = int(os.environ.get("IDLE_SECONDS", "10"))
|
||||
PROXY = os.environ.get("PROXY")
|
||||
BROWSER_PATH = os.environ.get("BROWSER_PATH")
|
||||
|
||||
# IPRoyal residential gateway. One fixed host/port; country, sticky-session id and
|
||||
# lifetime are encoded as underscore params appended to the password (see
|
||||
# _iproyal_password). Mirrors the .NET IpRoyalProxyProvider scheme.
|
||||
IPROYAL_HOST = os.environ.get("IPROYAL_HOST", "geo.iproyal.com")
|
||||
IPROYAL_PORT = int(os.environ.get("IPROYAL_PORT", "12321"))
|
||||
IPROYAL_USERNAME = os.environ.get("IPROYAL_USERNAME")
|
||||
IPROYAL_PASSWORD = os.environ.get("IPROYAL_PASSWORD")
|
||||
IPROYAL_COUNTRY = os.environ.get("IPROYAL_COUNTRY", "us").strip().lower()
|
||||
IPROYAL_LIFETIME_MIN = int(os.environ.get("IPROYAL_LIFETIME_MIN", "60"))
|
||||
# Residential proxy is metered per GB. Cloudflare gates on JS, not images, and the
|
||||
# sell-orders API is pure JSON — so block images by default to slash page-render
|
||||
# bandwidth. Set LOAD_IMAGES=1 to re-enable (e.g. for debugging the visible page).
|
||||
LOAD_IMAGES = os.environ.get("LOAD_IMAGES") == "1"
|
||||
|
||||
# cs.money is an Astro SSR app: the free-text market search filters server-side and
|
||||
# the resulting listings are embedded in the page as a __page-params JSON blob. The
|
||||
# /2.0/market/sell-orders API rejects a `search` param (HTTP 400), so we fetch the
|
||||
# PAGE for a search and read the embedded items — same item shape as the API.
|
||||
#
|
||||
# A page returns at most 60 and offset is ignored, so we paginate with a FORWARD
|
||||
# CURSOR on float: cs.money honors `order=asc&sort=float` + `minFloat`, and float is
|
||||
# full-precision and effectively unique per item. We grab the 60 lowest-float items
|
||||
# at/above `lo`, advance `lo` to the highest float returned, and repeat until a page
|
||||
# is under the cap. (The old minPrice/maxPrice bisection silently truncated cheap
|
||||
# skins: >60 listings can share a sub-$0.02 reference band, which no price window can
|
||||
# split — floats almost never tie, so the cursor always makes progress.)
|
||||
PAGE = ("https://cs.money/market/buy/?search={search}"
|
||||
"&order=asc&sort=float&minFloat={lo:.12f}&maxFloat=1")
|
||||
PAGE_CAP = 60 # items per SSR page
|
||||
PAGE_PARAMS_RE = re.compile(
|
||||
r'<script\b[^>]*id="__page-params"[^>]*>(.*?)</script>', re.S)
|
||||
|
||||
|
||||
# --- IPRoyal residential proxy ----------------------------------------------------
|
||||
|
||||
def _new_session_id() -> str:
|
||||
"""Short, opaque, URL-safe token. IPRoyal pins one residential exit IP per
|
||||
distinct session value, so a fresh id == a fresh IP."""
|
||||
return uuid.uuid4().hex[:10]
|
||||
|
||||
|
||||
def _iproyal_password(session_id: str) -> str:
|
||||
"""Bake the targeting/session knobs onto the account password, IPRoyal-style:
|
||||
"<pass>_country-us_session-<id>_lifetime-60m". Country is optional."""
|
||||
pw = IPROYAL_PASSWORD
|
||||
if IPROYAL_COUNTRY:
|
||||
pw += f"_country-{IPROYAL_COUNTRY}"
|
||||
pw += f"_session-{session_id}_lifetime-{IPROYAL_LIFETIME_MIN}m"
|
||||
return pw
|
||||
|
||||
|
||||
class LocalForwardingProxy:
|
||||
"""In-process HTTP proxy on 127.0.0.1 that chains every connection to the IPRoyal
|
||||
gateway, injecting the Proxy-Authorization header itself. Chromium ignores creds in
|
||||
--proxy-server and the in-browser ways to answer the gateway's 407 (a CDP auth
|
||||
handler, or a disabled MV2 extension) are Cloudflare tells — so we terminate the
|
||||
browser->proxy hop locally and add auth here, leaving Chrome to talk to an auth-free
|
||||
endpoint at zero CDP. HTTPS (all cs.money serves) flows through the CONNECT tunnel,
|
||||
so this proxy only relays ciphertext and never sees plaintext. Ported from the .NET
|
||||
LocalForwardingProxy. The active session token can be swapped live (set_password) to
|
||||
move to a fresh exit IP without restarting the browser. (New tunnels pick up the new
|
||||
IP; any still-open keep-alive tunnel stays on the old one until it closes.)"""
|
||||
|
||||
def __init__(self, host: str, port: int, username: str, password: str):
|
||||
self._host = host
|
||||
self._port = port
|
||||
self._username = username
|
||||
self._password = password
|
||||
self._server: asyncio.AbstractServer | None = None
|
||||
self.endpoint = ""
|
||||
|
||||
def set_password(self, password: str) -> None:
|
||||
self._password = password
|
||||
|
||||
def _auth_header(self) -> str:
|
||||
token = base64.b64encode(f"{self._username}:{self._password}".encode()).decode()
|
||||
return f"Proxy-Authorization: Basic {token}\r\n"
|
||||
|
||||
async def start(self) -> "LocalForwardingProxy":
|
||||
self._server = await asyncio.start_server(self._handle, "127.0.0.1", 0)
|
||||
port = self._server.sockets[0].getsockname()[1]
|
||||
self.endpoint = f"127.0.0.1:{port}"
|
||||
return self
|
||||
|
||||
async def stop(self) -> None:
|
||||
if self._server is not None:
|
||||
self._server.close()
|
||||
try:
|
||||
await self._server.wait_closed()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
@staticmethod
|
||||
async def _read_header(reader: asyncio.StreamReader) -> str | None:
|
||||
"""Read up to the end of the HTTP header block (CRLFCRLF). None on EOF/overflow."""
|
||||
try:
|
||||
data = await reader.readuntil(b"\r\n\r\n")
|
||||
except (asyncio.IncompleteReadError, asyncio.LimitOverrunError):
|
||||
return None
|
||||
return data.decode("latin-1")
|
||||
|
||||
async def _handle(self, client_reader: asyncio.StreamReader, client_writer: asyncio.StreamWriter) -> None:
|
||||
up_writer: asyncio.StreamWriter | None = None
|
||||
try:
|
||||
header = await self._read_header(client_reader)
|
||||
if not header:
|
||||
return
|
||||
parts = header.split("\r\n", 1)[0].split(" ")
|
||||
if len(parts) < 2:
|
||||
return
|
||||
method, target = parts[0], parts[1]
|
||||
|
||||
up_reader, up_writer = await asyncio.open_connection(self._host, self._port)
|
||||
if method.upper() == "CONNECT":
|
||||
# HTTPS: open an authenticated tunnel upstream, then relay raw bytes.
|
||||
up_writer.write(
|
||||
f"CONNECT {target} HTTP/1.1\r\nHost: {target}\r\n{self._auth_header()}\r\n".encode())
|
||||
await up_writer.drain()
|
||||
up_header = await self._read_header(up_reader)
|
||||
status = up_header.split(" ", 2) if up_header else []
|
||||
if len(status) < 2 or status[1] != "200":
|
||||
line = (up_header or "no response").split("\r\n", 1)[0]
|
||||
print(f" proxy: upstream refused CONNECT {target}: {line}")
|
||||
client_writer.write(b"HTTP/1.1 502 Bad Gateway\r\nConnection: close\r\n\r\n")
|
||||
await client_writer.drain()
|
||||
return
|
||||
client_writer.write(b"HTTP/1.1 200 Connection established\r\n\r\n")
|
||||
await client_writer.drain()
|
||||
else:
|
||||
# Plain HTTP: re-inject the request upstream with auth, then relay.
|
||||
idx = header.index("\r\n") + 2
|
||||
up_writer.write((header[:idx] + self._auth_header() + header[idx:]).encode())
|
||||
await up_writer.drain()
|
||||
|
||||
await self._relay(client_reader, client_writer, up_reader, up_writer)
|
||||
except Exception:
|
||||
pass # one bad tunnel must never take down the listener
|
||||
finally:
|
||||
for w in (client_writer, up_writer):
|
||||
if w is not None:
|
||||
try:
|
||||
w.close()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
@staticmethod
|
||||
async def _relay(
|
||||
client_reader: asyncio.StreamReader, client_writer: asyncio.StreamWriter,
|
||||
up_reader: asyncio.StreamReader, up_writer: asyncio.StreamWriter) -> None:
|
||||
async def pipe(reader: asyncio.StreamReader, writer: asyncio.StreamWriter) -> None:
|
||||
try:
|
||||
while data := await reader.read(65536):
|
||||
writer.write(data)
|
||||
await writer.drain()
|
||||
except Exception:
|
||||
pass
|
||||
await asyncio.gather(
|
||||
pipe(client_reader, up_writer),
|
||||
pipe(up_reader, client_writer),
|
||||
)
|
||||
|
||||
|
||||
def looks_like_challenge(body: str) -> bool:
|
||||
s = (body or "").lstrip()
|
||||
return not s or s.startswith("<") or "Just a moment" in body or "challenge-platform" in body
|
||||
|
||||
|
||||
# --- C2 HTTP (stdlib, run off the event loop) -------------------------------------
|
||||
|
||||
def _get_job_sync():
|
||||
req = urllib.request.Request(f"{C2_URL}/jobs/next", headers={"X-Worker-Token": TOKEN})
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=15) as r:
|
||||
if r.status == 204:
|
||||
return None
|
||||
return json.loads(r.read() or b"null")
|
||||
except urllib.error.HTTPError as e:
|
||||
print(f" C2 /jobs/next -> HTTP {e.code}")
|
||||
return None
|
||||
except urllib.error.URLError as e:
|
||||
print(f" C2 unreachable: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def _post_result_sync(job_id: str, payload: dict):
|
||||
data = json.dumps(payload).encode()
|
||||
req = urllib.request.Request(
|
||||
f"{C2_URL}/jobs/{job_id}/result", data=data, method="POST",
|
||||
headers={"X-Worker-Token": TOKEN, "Content-Type": "application/json"})
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=60) as r:
|
||||
return json.loads(r.read() or b"null")
|
||||
except urllib.error.HTTPError as e:
|
||||
print(f" C2 result -> HTTP {e.code}: {e.read()[:200]!r}")
|
||||
return None
|
||||
except urllib.error.URLError as e:
|
||||
print(f" C2 unreachable posting result: {e}")
|
||||
return None
|
||||
|
||||
|
||||
async def get_job():
|
||||
return await asyncio.to_thread(_get_job_sync)
|
||||
|
||||
|
||||
async def post_result(job_id, payload):
|
||||
return await asyncio.to_thread(_post_result_sync, job_id, payload)
|
||||
|
||||
|
||||
# --- scraping ---------------------------------------------------------------------
|
||||
|
||||
async def fetch_json(page, url: str) -> tuple[str, str]:
|
||||
expr = (
|
||||
f"fetch({url!r}, {{credentials:'include', headers:{{'accept':'application/json'}}}})"
|
||||
f".then(async r => JSON.stringify({{status: r.status, body: await r.text()}}))"
|
||||
)
|
||||
raw = await page.evaluate(expr, await_promise=True)
|
||||
if not isinstance(raw, str):
|
||||
return ("-1", "")
|
||||
try:
|
||||
obj = json.loads(raw)
|
||||
return (str(obj.get("status", "-1")), obj.get("body", ""))
|
||||
except json.JSONDecodeError:
|
||||
return ("-1", raw)
|
||||
|
||||
|
||||
async def _click(page, text, timeout=3):
|
||||
try:
|
||||
el = await page.find(text, best_match=True, timeout=timeout)
|
||||
if el:
|
||||
await el.click()
|
||||
return True
|
||||
except Exception:
|
||||
pass
|
||||
return False
|
||||
|
||||
|
||||
async def dismiss_consent(page):
|
||||
"""Privacy-preserving. The banner only offers 'Accept all' / 'Manage cookies';
|
||||
the Reject-all control lives inside the Manage window. So: Manage -> Reject all ->
|
||||
Confirm. (The data path reads SSR __page-params regardless, but this keeps the
|
||||
session honest and unblocks any future interaction.)"""
|
||||
steps = []
|
||||
if await _click(page, "Manage cookies") or await _click(page, "Manage"):
|
||||
await page.sleep(1)
|
||||
if await _click(page, "Reject all"):
|
||||
steps.append("reject-all")
|
||||
for c in ("Confirm my choice", "Confirm", "Save"):
|
||||
if await _click(page, c):
|
||||
steps.append(f"confirm:{c}")
|
||||
break
|
||||
return ", ".join(steps) if steps else None
|
||||
|
||||
|
||||
async def warm(page):
|
||||
"""Open the market and clear Cloudflare so the session holds cf_clearance."""
|
||||
print(f"Warming session at {MARKET_URL} (clear Cloudflare; {SOLVE_SECONDS}s)...")
|
||||
await page.get(MARKET_URL)
|
||||
await page.sleep(SOLVE_SECONDS)
|
||||
clicked = await dismiss_consent(page)
|
||||
print(f"Consent: {'dismissed via ' + clicked if clicked else 'left up'}")
|
||||
|
||||
|
||||
def extract_items(html: str) -> list:
|
||||
"""Pull inventory.items out of the page's __page-params JSON blob."""
|
||||
m = PAGE_PARAMS_RE.search(html)
|
||||
if not m:
|
||||
return []
|
||||
try:
|
||||
return json.loads(m.group(1)).get("inventory", {}).get("items", []) or []
|
||||
except json.JSONDecodeError:
|
||||
return []
|
||||
|
||||
|
||||
async def scrape_job(page, job) -> tuple[list, int, str]:
|
||||
"""Scrape ALL listings for one skin+wear via a forward float cursor.
|
||||
|
||||
A search page returns at most 60 items and ignores offset, but cs.money sorts by
|
||||
float (order=asc&sort=float) and filters by minFloat. So we walk the float axis:
|
||||
grab the 60 lowest-float items at/above `lo`, advance `lo` to the highest float on
|
||||
the page, and repeat until a page is under the cap. The boundary item is re-fetched
|
||||
(minFloat is inclusive) and dropped by the id dedup. Returns (items, fetches, reason).
|
||||
"""
|
||||
search = urllib.parse.quote_plus(job["search"])
|
||||
max_fetches = job.get("maxPages", 40) # safety cap on page fetches per job
|
||||
seen: dict = {}
|
||||
fetches = 0
|
||||
lo = 0.0
|
||||
reason = "completed"
|
||||
|
||||
while fetches < max_fetches:
|
||||
status, body = await fetch_json(page, PAGE.format(search=search, lo=lo))
|
||||
fetches += 1
|
||||
|
||||
if "Just a moment" in body or "challenge-platform" in body:
|
||||
return list(seen.values()), fetches, "challenged"
|
||||
|
||||
items = extract_items(body)
|
||||
floats = []
|
||||
for it in items:
|
||||
if it.get("id") is not None:
|
||||
seen[it["id"]] = it
|
||||
fl = (it.get("asset") or {}).get("float")
|
||||
if isinstance(fl, (int, float)):
|
||||
floats.append(fl)
|
||||
|
||||
if len(items) < PAGE_CAP:
|
||||
break # last page — fewer than the cap means we've seen everything
|
||||
|
||||
# Advance the cursor past the highest float on this page. Items at exactly that
|
||||
# float are re-fetched next round (minFloat is inclusive) and deduped by id.
|
||||
nxt = max(floats) if floats else None
|
||||
if nxt is None or nxt <= lo:
|
||||
# Cursor can't advance: >60 listings share a single float value, or the
|
||||
# items carry no float. Bail loudly rather than spin — a flagged gap beats
|
||||
# a silent one (this is the failure the price-window version hid).
|
||||
reason = "stuck-float-tie"
|
||||
break
|
||||
lo = nxt
|
||||
|
||||
await page.sleep(DELAY + random.uniform(0, JITTER))
|
||||
else:
|
||||
reason = "fetch-cap"
|
||||
|
||||
return list(seen.values()), fetches, reason
|
||||
|
||||
|
||||
async def main():
|
||||
# IPRoyal (auth'd, per-worker sticky IP) takes priority; else a plain auth-free
|
||||
# PROXY; else this host's own IP. The forwarder injects IPRoyal auth so Chrome
|
||||
# only ever sees an auth-free 127.0.0.1 endpoint.
|
||||
forwarder = None
|
||||
session_id = None
|
||||
if IPROYAL_USERNAME and IPROYAL_PASSWORD:
|
||||
session_id = _new_session_id()
|
||||
forwarder = await LocalForwardingProxy(
|
||||
IPROYAL_HOST, IPROYAL_PORT, IPROYAL_USERNAME, _iproyal_password(session_id)).start()
|
||||
proxy = forwarder.endpoint
|
||||
proxy_label = f"iproyal[{IPROYAL_COUNTRY or 'any'}] session {session_id} via {forwarder.endpoint}"
|
||||
else:
|
||||
proxy = PROXY
|
||||
proxy_label = PROXY or "own IP"
|
||||
|
||||
args = [f"--proxy-server={proxy}"] if proxy else []
|
||||
if not LOAD_IMAGES:
|
||||
# Disable image loading at the engine level — the dominant bandwidth cost on
|
||||
# an image-heavy market, and unneeded for CF clearance or the JSON API.
|
||||
args.append("--blink-settings=imagesEnabled=false")
|
||||
if os.environ.get("CHROME_NO_SANDBOX") == "1":
|
||||
# Required when running Chromium as root in a container.
|
||||
args += ["--no-sandbox", "--disable-dev-shm-usage"]
|
||||
print(f"Starting worker (C2={C2_URL}, proxy={proxy_label}, images={'on' if LOAD_IMAGES else 'off'})...")
|
||||
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH, browser_args=args)
|
||||
try:
|
||||
page = await browser.get("about:blank")
|
||||
await warm(page)
|
||||
|
||||
while True:
|
||||
job = await get_job()
|
||||
if not job:
|
||||
await asyncio.sleep(IDLE_SECONDS)
|
||||
continue
|
||||
|
||||
print(f"Job {job['jobId'][:8]} — search {job['search']!r}")
|
||||
items, pages, reason = await scrape_job(page, job)
|
||||
|
||||
if reason == "challenged":
|
||||
# The exit IP is likely flagged. On IPRoyal, rotate to a fresh sticky
|
||||
# session (new IP) before re-warming; otherwise just re-solve in place.
|
||||
if forwarder is not None:
|
||||
session_id = _new_session_id()
|
||||
forwarder.set_password(_iproyal_password(session_id))
|
||||
print(f" challenged; rotating exit IP -> session {session_id}, re-warming...")
|
||||
else:
|
||||
print(" re-challenged; re-warming session...")
|
||||
await warm(page)
|
||||
|
||||
result = await post_result(job["jobId"], {
|
||||
"items": items, "pages": pages, "stoppedReason": reason})
|
||||
summary = (f"matched {result.get('matched')}, new {result.get('inserted')}, "
|
||||
f"upd {result.get('updated')}, removed {result.get('removed')}") if result else "post failed"
|
||||
print(f" scraped {len(items)} items ({pages}p, {reason}) -> {summary}")
|
||||
|
||||
await page.sleep(DELAY + random.uniform(0, JITTER))
|
||||
finally:
|
||||
browser.stop()
|
||||
if forwarder is not None:
|
||||
await forwarder.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
Reference in New Issue
Block a user