almost ready
This commit is contained in:
@@ -18,13 +18,20 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
WORKDIR /app
|
||||
COPY worker/requirements.txt ./
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
COPY worker/worker.py worker/entrypoint.sh ./
|
||||
# blworker/ is the shared package both market scripts import; ship it + the two thin
|
||||
# market scripts + the entrypoint.
|
||||
COPY worker/blworker ./blworker
|
||||
COPY worker/csmoney_worker.py worker/skinland_worker.py worker/entrypoint.sh ./
|
||||
RUN chmod +x entrypoint.sh
|
||||
|
||||
# Which worker this image runs (overridden per service in docker-compose). The cs.money
|
||||
# worker is the default; the skin.land service sets WORKER_SCRIPT=skinland_worker.py.
|
||||
ENV BROWSER_PATH=/usr/bin/chromium \
|
||||
CHROME_NO_SANDBOX=1 \
|
||||
DISPLAY=:99 \
|
||||
SOLVE_SECONDS=45 \
|
||||
WORKER_SCRIPT=csmoney_worker.py \
|
||||
LOG_JSON=1 \
|
||||
PYTHONUNBUFFERED=1
|
||||
|
||||
|
||||
|
||||
@@ -14,47 +14,27 @@ webdriver` and chromedriver `cdc_` artifacts that Cloudflare keys on. `nodriver`
|
||||
drives a normal Chromium directly over CDP (no chromedriver) and patches those
|
||||
tells, so it passes where Selenium loops.
|
||||
|
||||
## Step 1: prove it (current)
|
||||
|
||||
`poc.py` proves nodriver can clear cs.money's Cloudflare and fetch the listings API
|
||||
before we build the full pull-based fleet.
|
||||
## Local setup
|
||||
|
||||
```powershell
|
||||
cd worker
|
||||
py -m venv .venv
|
||||
.venv\Scripts\Activate.ps1
|
||||
pip install -r requirements.txt
|
||||
python poc.py
|
||||
```
|
||||
|
||||
A Chromium window opens on the market. Solve the Cloudflare check if shown; the
|
||||
script waits, then pages `sell-orders` deeply (PAGES), reporting how far the warm
|
||||
session survives before any re-challenge and confirming full float precision.
|
||||
Output lands in `worker/captures/`.
|
||||
|
||||
**Targeted skin+wear search.** cs.money search is free-text on the page
|
||||
(`?search=cyber+security+ft`). Set `SEARCH` and the PoC navigates there, **captures
|
||||
the actual filtered `sell-orders` API request the page fires** (so we learn the real
|
||||
filter params instead of guessing), prints it, then pages that filtered API:
|
||||
|
||||
```powershell
|
||||
$env:SEARCH="cyber security ft"; python poc.py # FT M4A4 Cyber Security only
|
||||
```
|
||||
|
||||
The `>>> DISCOVERED sell-orders API call` line shows how the search maps to API
|
||||
params — that's how the C2 will build targeted jobs.
|
||||
|
||||
Run on your own IP first (no proxy) — that's the clean A/B vs. the Selenium run.
|
||||
If auto-detect can't find a browser, set `BROWSER_PATH` to Chrome or Edge
|
||||
(`C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe`).
|
||||
|
||||
## Step 2: the pull fleet
|
||||
## The pull fleet
|
||||
|
||||
`worker.py` holds one warm nodriver session and loops: poll the .NET C2 for a job
|
||||
(a skin+wear search), scrape that search's sell-orders via in-page fetch, and post
|
||||
`csmoney_worker.py` holds one warm nodriver session and loops: poll the .NET C2 for a
|
||||
job (a skin+wear search), scrape that search's sell-orders via in-page fetch, and post
|
||||
the items back. The C2 (`BlueLaminate.C2`) picks the stalest skin+wear from the
|
||||
catalogue, and on result persists to `cs_money_listings` + `price_history`
|
||||
(`Source = "csmoney"`), stamping `SkinCondition.ListingsSweptAt`.
|
||||
(`Source = "csmoney"`), stamping that band's per-site checkpoint (the `csmoney`
|
||||
row in `skin_condition_sweeps`). The checkpoint is per-site, so a band CSFloat
|
||||
already swept is still due for a cs.money sweep.
|
||||
|
||||
Run the C2 (needs Postgres migrated), then the worker:
|
||||
|
||||
@@ -65,8 +45,64 @@ dotnet run --project BlueLaminate\BlueLaminate.C2 # serves http://local
|
||||
# terminal 2 — the worker
|
||||
cd worker; .venv\Scripts\Activate.ps1
|
||||
$env:WORKER_TOKEN="dev-worker-token" # must match the C2's WorkerToken
|
||||
python worker.py
|
||||
python csmoney_worker.py
|
||||
```
|
||||
|
||||
The worker warms the session (you clear Cloudflare once), then runs continuously.
|
||||
Scale out by starting more workers (each with its own `PROXY`).
|
||||
|
||||
## Layout
|
||||
|
||||
Both market scripts are thin: each subclasses `blworker.Worker` and fills in only its
|
||||
own scrape + cookie-consent steps. Everything shared lives in the `blworker/` package:
|
||||
|
||||
| file | responsibility |
|
||||
| --- | --- |
|
||||
| `blworker/config.py` | `Settings` — every env knob, parsed once |
|
||||
| `blworker/log.py` | stdout logging, human or `LOG_JSON=1` (for Loki) |
|
||||
| `blworker/proxy.py` | IPRoyal forwarder + session/password helpers |
|
||||
| `blworker/c2.py` | `C2Client` — claim a job, post a result |
|
||||
| `blworker/runtime.py` | `Worker` base: proxy/browser bring-up, the poll→scrape→post loop, Cloudflare IP rotation, graceful shutdown |
|
||||
| `csmoney_worker.py` / `skinland_worker.py` | the per-market scrape strategies |
|
||||
|
||||
To add a market: subclass `Worker`, set `name`/`jobs_path`/`default_market_url`, implement
|
||||
`scrape_job` + `describe_job` (+ `dismiss_consent` if it has a banner), and call
|
||||
`run(YourWorker)`.
|
||||
|
||||
## skin.land worker
|
||||
|
||||
`skinland_worker.py` is the same pull model for **skin.land** (also Cloudflare-walled). It
|
||||
shares all the proxy/Cloudflare/C2 plumbing with the cs.money worker via `blworker`; only
|
||||
the scrape differs. The C2 hands out jobs from its **`/skinland/jobs`** group (the
|
||||
`skinland` rows in `skin_condition_sweeps`, so a band cs.money/CSFloat already swept is
|
||||
still due here) and on result persists to `skin_land_listings` + `price_history`
|
||||
(`Source = "skinland"`).
|
||||
|
||||
How it scrapes (learned during discovery):
|
||||
|
||||
- A job's target is the market **page URL**, e.g.
|
||||
`https://skin.land/market/csgo/ak-47-redline-field-tested/`. The slug is just
|
||||
`{weapon}-{skin}-{wear}` kebab-cased — the C2 builds it from the catalogue, no lookup.
|
||||
- skin.land is a Nuxt SSR app. The page embeds an internal numeric `skin_id`; the worker
|
||||
resolves it once from the `__NUXT__` payload (the skin object whose `url` == the slug),
|
||||
caches it per slug, then pages the clean JSON API
|
||||
`GET https://app.skin.land/api/v2/obtained-skins?skin_id={id}&page={n}` (a Laravel
|
||||
paginator `{data:[…offers], meta:{current_page,last_page,…}}`), walking to `last_page`.
|
||||
- Each offer carries a full-precision `item_float`, `final_withdrawal_price`, and the steam
|
||||
`item_link`. skin.land exposes **no paint seed**, so listings aren't fingerprinted to a
|
||||
`SkinInstance` (no cross-market roll-up / dupe detection here). StatTrak and Souvenir are
|
||||
separate pages (`stattrak-`/`souvenir-` slugs); v1 sweeps the base page per skin+wear.
|
||||
|
||||
Run it alongside (or instead of) the cs.money worker — it points at the same C2:
|
||||
|
||||
```powershell
|
||||
cd worker; .venv\Scripts\Activate.ps1
|
||||
$env:WORKER_TOKEN="dev-worker-token"
|
||||
python skinland_worker.py
|
||||
```
|
||||
|
||||
Under Docker it's the `skinland-worker` service (same image, `WORKER_SCRIPT=skinland_worker.py`):
|
||||
|
||||
```powershell
|
||||
docker compose up --build --scale skinland-worker=5
|
||||
```
|
||||
|
||||
20
worker/blworker/__init__.py
Normal file
20
worker/blworker/__init__.py
Normal file
@@ -0,0 +1,20 @@
|
||||
"""Shared scaffolding for the BlueLaminate market scrape workers.
|
||||
|
||||
A market worker (cs.money, skin.land, …) subclasses `Worker`, fills in its scrape +
|
||||
consent steps, and calls `run(MyWorker)`. Everything else — config, logging, the IPRoyal
|
||||
proxy/forwarder, the C2 client, the poll/scrape/post loop, IP rotation, graceful
|
||||
shutdown — lives here so it's written once.
|
||||
"""
|
||||
|
||||
from .config import Settings
|
||||
from .runtime import ScrapeResult, Worker, click, looks_like_challenge, page_fetch, run
|
||||
|
||||
__all__ = [
|
||||
"Settings",
|
||||
"ScrapeResult",
|
||||
"Worker",
|
||||
"click",
|
||||
"looks_like_challenge",
|
||||
"page_fetch",
|
||||
"run",
|
||||
]
|
||||
57
worker/blworker/c2.py
Normal file
57
worker/blworker/c2.py
Normal file
@@ -0,0 +1,57 @@
|
||||
"""HTTP client for the .NET C2's job endpoints.
|
||||
|
||||
Stdlib urllib so the blocking calls run off the asyncio loop via to_thread (the event
|
||||
loop belongs to the browser). Each worker points at one job route group — "/jobs" for
|
||||
cs.money, "/skinland/jobs" for skin.land — set once at construction.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
import logging
|
||||
import urllib.error
|
||||
import urllib.request
|
||||
|
||||
log = logging.getLogger("c2")
|
||||
|
||||
|
||||
class C2Client:
|
||||
def __init__(self, base_url: str, token: str, jobs_path: str):
|
||||
self._base = base_url.rstrip("/")
|
||||
self._token = token
|
||||
self._jobs = jobs_path.strip("/")
|
||||
|
||||
def _get_job_sync(self):
|
||||
req = urllib.request.Request(
|
||||
f"{self._base}/{self._jobs}/next", headers={"X-Worker-Token": self._token})
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=15) as r:
|
||||
if r.status == 204:
|
||||
return None
|
||||
return json.loads(r.read() or b"null")
|
||||
except urllib.error.HTTPError as e:
|
||||
log.warning("/%s/next -> HTTP %s", self._jobs, e.code)
|
||||
return None
|
||||
except urllib.error.URLError as e:
|
||||
log.warning("C2 unreachable: %s", e)
|
||||
return None
|
||||
|
||||
def _post_result_sync(self, job_id: str, payload: dict):
|
||||
data = json.dumps(payload).encode()
|
||||
req = urllib.request.Request(
|
||||
f"{self._base}/{self._jobs}/{job_id}/result", data=data, method="POST",
|
||||
headers={"X-Worker-Token": self._token, "Content-Type": "application/json"})
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=60) as r:
|
||||
return json.loads(r.read() or b"null")
|
||||
except urllib.error.HTTPError as e:
|
||||
log.warning("result -> HTTP %s: %r", e.code, e.read()[:200])
|
||||
return None
|
||||
except urllib.error.URLError as e:
|
||||
log.warning("C2 unreachable posting result: %s", e)
|
||||
return None
|
||||
|
||||
async def get_job(self):
|
||||
return await asyncio.to_thread(self._get_job_sync)
|
||||
|
||||
async def post_result(self, job_id, payload):
|
||||
return await asyncio.to_thread(self._post_result_sync, job_id, payload)
|
||||
81
worker/blworker/config.py
Normal file
81
worker/blworker/config.py
Normal file
@@ -0,0 +1,81 @@
|
||||
"""Worker configuration, parsed once from the environment.
|
||||
|
||||
All env knobs the workers honor live here so there's a single source of truth (the
|
||||
two market workers used to each re-parse the same ~15 vars). Frozen dataclass — read
|
||||
it, don't mutate it.
|
||||
"""
|
||||
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
def _int(name: str, default: int) -> int:
|
||||
return int(os.environ.get(name, str(default)))
|
||||
|
||||
|
||||
def _float(name: str, default: float) -> float:
|
||||
return float(os.environ.get(name, str(default)))
|
||||
|
||||
|
||||
def _flag(name: str) -> bool:
|
||||
return os.environ.get(name) == "1"
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class Settings:
|
||||
# C2
|
||||
c2_url: str
|
||||
token: str
|
||||
# Session / pacing
|
||||
market_url: str # "" => use the worker's own default page
|
||||
solve_seconds: int
|
||||
delay: float
|
||||
jitter: float
|
||||
idle_seconds: int
|
||||
# Browser
|
||||
browser_path: str | None
|
||||
load_images: bool
|
||||
chrome_no_sandbox: bool
|
||||
# Proxy (auth-free fallback)
|
||||
proxy: str | None
|
||||
# IPRoyal residential gateway
|
||||
iproyal_host: str
|
||||
iproyal_port: int
|
||||
iproyal_username: str | None
|
||||
iproyal_password: str | None
|
||||
iproyal_country: str
|
||||
iproyal_lifetime_min: int
|
||||
# Logging
|
||||
log_level: str
|
||||
log_json: bool
|
||||
|
||||
@property
|
||||
def use_iproyal(self) -> bool:
|
||||
"""IPRoyal takes priority over a plain PROXY when its creds are set."""
|
||||
return bool(self.iproyal_username and self.iproyal_password)
|
||||
|
||||
@classmethod
|
||||
def from_env(cls) -> "Settings":
|
||||
return cls(
|
||||
c2_url=os.environ.get("C2_URL", "http://localhost:5080").rstrip("/"),
|
||||
token=os.environ.get("WORKER_TOKEN", "dev-worker-token"),
|
||||
market_url=os.environ.get("MARKET_URL", ""),
|
||||
solve_seconds=_int("SOLVE_SECONDS", 30),
|
||||
delay=_float("DELAY", 2.0),
|
||||
jitter=_float("JITTER", 1.5),
|
||||
idle_seconds=_int("IDLE_SECONDS", 10),
|
||||
browser_path=os.environ.get("BROWSER_PATH") or None,
|
||||
# Residential proxy is metered per GB; Cloudflare gates on JS, not images, and
|
||||
# the market APIs are pure JSON — so block images unless explicitly debugging.
|
||||
load_images=_flag("LOAD_IMAGES"),
|
||||
chrome_no_sandbox=_flag("CHROME_NO_SANDBOX"),
|
||||
proxy=os.environ.get("PROXY") or None,
|
||||
iproyal_host=os.environ.get("IPROYAL_HOST", "geo.iproyal.com"),
|
||||
iproyal_port=_int("IPROYAL_PORT", 12321),
|
||||
iproyal_username=os.environ.get("IPROYAL_USERNAME") or None,
|
||||
iproyal_password=os.environ.get("IPROYAL_PASSWORD") or None,
|
||||
iproyal_country=os.environ.get("IPROYAL_COUNTRY", "us").strip().lower(),
|
||||
iproyal_lifetime_min=_int("IPROYAL_LIFETIME_MIN", 60),
|
||||
log_level=os.environ.get("LOG_LEVEL", "INFO").upper(),
|
||||
log_json=_flag("LOG_JSON"),
|
||||
)
|
||||
47
worker/blworker/log.py
Normal file
47
worker/blworker/log.py
Normal file
@@ -0,0 +1,47 @@
|
||||
"""Stdlib logging setup — one stream handler on stdout, human or JSON.
|
||||
|
||||
Workers used to print() everything; that gives no levels, no timestamps, and nothing
|
||||
Loki can parse. Default is a compact human format for local runs; set LOG_JSON=1 in the
|
||||
container so Grafana Alloy -> Loki gets structured fields (ts, level, logger, msg) plus
|
||||
any `extra=` keys a call site attaches.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
|
||||
# logging.LogRecord built-ins we don't want to echo into a JSON line as "extra" fields.
|
||||
_RESERVED = set(
|
||||
logging.makeLogRecord({}).__dict__
|
||||
) | {"message", "asctime", "taskName"}
|
||||
|
||||
|
||||
class _JsonFormatter(logging.Formatter):
|
||||
def format(self, record: logging.LogRecord) -> str:
|
||||
payload = {
|
||||
"ts": self.formatTime(record, "%Y-%m-%dT%H:%M:%S%z"),
|
||||
"level": record.levelname,
|
||||
"logger": record.name,
|
||||
"msg": record.getMessage(),
|
||||
}
|
||||
for key, value in record.__dict__.items():
|
||||
if key not in _RESERVED and not key.startswith("_"):
|
||||
payload[key] = value
|
||||
if record.exc_info:
|
||||
payload["exc"] = self.formatException(record.exc_info)
|
||||
return json.dumps(payload, default=str)
|
||||
|
||||
|
||||
def configure(level: str = "INFO", json_logs: bool = False) -> None:
|
||||
"""Install a single stdout handler on the root logger (idempotent)."""
|
||||
handler = logging.StreamHandler(sys.stdout)
|
||||
if json_logs:
|
||||
handler.setFormatter(_JsonFormatter())
|
||||
else:
|
||||
handler.setFormatter(
|
||||
logging.Formatter("%(asctime)s %(levelname)-5s %(name)s | %(message)s", "%H:%M:%S")
|
||||
)
|
||||
root = logging.getLogger()
|
||||
root.handlers.clear()
|
||||
root.addHandler(handler)
|
||||
root.setLevel(level)
|
||||
154
worker/blworker/proxy.py
Normal file
154
worker/blworker/proxy.py
Normal file
@@ -0,0 +1,154 @@
|
||||
"""IPRoyal residential proxy plumbing.
|
||||
|
||||
The in-process forwarder + the password/session helpers — identical across every market
|
||||
worker, so they live here. HTTPS market traffic flows through the CONNECT tunnel, so the
|
||||
forwarder only ever relays ciphertext. Ported from the .NET LocalForwardingProxy /
|
||||
IpRoyalProxyProvider.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import base64
|
||||
import logging
|
||||
import uuid
|
||||
|
||||
log = logging.getLogger("proxy")
|
||||
|
||||
|
||||
def new_session_id() -> str:
|
||||
"""Short, opaque, URL-safe token. IPRoyal pins one residential exit IP per distinct
|
||||
session value, so a fresh id == a fresh IP."""
|
||||
return uuid.uuid4().hex[:10]
|
||||
|
||||
|
||||
def iproyal_password(password: str, country: str, lifetime_min: int, session_id: str) -> str:
|
||||
"""Bake the targeting/session knobs onto the account password, IPRoyal-style:
|
||||
"<pass>_country-us_session-<id>_lifetime-60m". Country is optional."""
|
||||
pw = password
|
||||
if country:
|
||||
pw += f"_country-{country}"
|
||||
pw += f"_session-{session_id}_lifetime-{lifetime_min}m"
|
||||
return pw
|
||||
|
||||
|
||||
class LocalForwardingProxy:
|
||||
"""In-process HTTP proxy on 127.0.0.1 that chains every connection to the IPRoyal
|
||||
gateway, injecting the Proxy-Authorization header itself. Chromium ignores creds in
|
||||
--proxy-server and the in-browser ways to answer the gateway's 407 (a CDP auth
|
||||
handler, or a disabled MV2 extension) are Cloudflare tells — so we terminate the
|
||||
browser->proxy hop locally and add auth here, leaving Chrome to talk to an auth-free
|
||||
endpoint at zero CDP. HTTPS (all market traffic) flows through the CONNECT tunnel, so
|
||||
this proxy only relays ciphertext and never sees plaintext. The active session token
|
||||
can be swapped live (set_password) to move to a fresh exit IP without restarting the
|
||||
browser. (New tunnels pick up the new IP; any still-open keep-alive tunnel stays on
|
||||
the old one until it closes.)"""
|
||||
|
||||
def __init__(self, host: str, port: int, username: str, password: str):
|
||||
self._host = host
|
||||
self._port = port
|
||||
self._username = username
|
||||
self._password = password
|
||||
self._server: asyncio.AbstractServer | None = None
|
||||
self.endpoint = ""
|
||||
|
||||
def set_password(self, password: str) -> None:
|
||||
self._password = password
|
||||
|
||||
def _auth_header(self) -> str:
|
||||
token = base64.b64encode(f"{self._username}:{self._password}".encode()).decode()
|
||||
return f"Proxy-Authorization: Basic {token}\r\n"
|
||||
|
||||
async def start(self) -> "LocalForwardingProxy":
|
||||
self._server = await asyncio.start_server(self._handle, "127.0.0.1", 0)
|
||||
port = self._server.sockets[0].getsockname()[1]
|
||||
self.endpoint = f"127.0.0.1:{port}"
|
||||
return self
|
||||
|
||||
async def stop(self) -> None:
|
||||
if self._server is not None:
|
||||
self._server.close()
|
||||
try:
|
||||
await self._server.wait_closed()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
@staticmethod
|
||||
async def _read_header(reader: asyncio.StreamReader) -> str | None:
|
||||
"""Read up to the end of the HTTP header block (CRLFCRLF). None on EOF/overflow."""
|
||||
try:
|
||||
data = await reader.readuntil(b"\r\n\r\n")
|
||||
except (asyncio.IncompleteReadError, asyncio.LimitOverrunError):
|
||||
return None
|
||||
return data.decode("latin-1")
|
||||
|
||||
async def _handle(self, client_reader: asyncio.StreamReader, client_writer: asyncio.StreamWriter) -> None:
|
||||
up_writer: asyncio.StreamWriter | None = None
|
||||
try:
|
||||
header = await self._read_header(client_reader)
|
||||
if not header:
|
||||
return
|
||||
parts = header.split("\r\n", 1)[0].split(" ")
|
||||
if len(parts) < 2:
|
||||
return
|
||||
method, target = parts[0], parts[1]
|
||||
|
||||
up_reader, up_writer = await asyncio.open_connection(self._host, self._port)
|
||||
if method.upper() == "CONNECT":
|
||||
# HTTPS: open an authenticated tunnel upstream, then relay raw bytes.
|
||||
up_writer.write(
|
||||
f"CONNECT {target} HTTP/1.1\r\nHost: {target}\r\n{self._auth_header()}\r\n".encode())
|
||||
await up_writer.drain()
|
||||
up_header = await self._read_header(up_reader)
|
||||
status = up_header.split(" ", 2) if up_header else []
|
||||
if len(status) < 2 or status[1] != "200":
|
||||
line = (up_header or "no response").split("\r\n", 1)[0]
|
||||
log.warning("upstream refused CONNECT %s: %s", target, line)
|
||||
client_writer.write(b"HTTP/1.1 502 Bad Gateway\r\nConnection: close\r\n\r\n")
|
||||
await client_writer.drain()
|
||||
return
|
||||
client_writer.write(b"HTTP/1.1 200 Connection established\r\n\r\n")
|
||||
await client_writer.drain()
|
||||
else:
|
||||
# Plain HTTP: re-inject the request upstream with auth, then relay.
|
||||
idx = header.index("\r\n") + 2
|
||||
up_writer.write((header[:idx] + self._auth_header() + header[idx:]).encode())
|
||||
await up_writer.drain()
|
||||
|
||||
await self._relay(client_reader, client_writer, up_reader, up_writer)
|
||||
except Exception:
|
||||
pass # one bad tunnel must never take down the listener
|
||||
finally:
|
||||
for w in (client_writer, up_writer):
|
||||
if w is not None:
|
||||
try:
|
||||
w.close()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
@staticmethod
|
||||
async def _relay(
|
||||
client_reader: asyncio.StreamReader, client_writer: asyncio.StreamWriter,
|
||||
up_reader: asyncio.StreamReader, up_writer: asyncio.StreamWriter) -> None:
|
||||
# Pipe both directions, but tear the whole tunnel down as soon as EITHER side
|
||||
# closes (mirrors the .NET WhenAny). Waiting for both — as a plain gather does —
|
||||
# leaks a task holding two sockets on every half-closed connection, which piles
|
||||
# up fast across a long multi-worker run. Closing both writers when the first pipe
|
||||
# finishes unblocks the other's pending read so both tasks settle.
|
||||
async def pipe(reader: asyncio.StreamReader, writer: asyncio.StreamWriter) -> None:
|
||||
try:
|
||||
while data := await reader.read(65536):
|
||||
writer.write(data)
|
||||
await writer.drain()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
a = asyncio.create_task(pipe(client_reader, up_writer))
|
||||
b = asyncio.create_task(pipe(up_reader, client_writer))
|
||||
try:
|
||||
await asyncio.wait({a, b}, return_when=asyncio.FIRST_COMPLETED)
|
||||
finally:
|
||||
for w in (client_writer, up_writer):
|
||||
try:
|
||||
w.close()
|
||||
except Exception:
|
||||
pass
|
||||
await asyncio.gather(a, b, return_exceptions=True)
|
||||
235
worker/blworker/runtime.py
Normal file
235
worker/blworker/runtime.py
Normal file
@@ -0,0 +1,235 @@
|
||||
"""The shared worker runtime — everything that's identical across market workers.
|
||||
|
||||
`Worker` is a template-method base: it owns the proxy/browser bring-up, the poll ->
|
||||
scrape -> post loop, Cloudflare-driven IP rotation, result logging, and graceful
|
||||
shutdown. A market worker subclasses it and fills in only what differs — how to dismiss
|
||||
the consent banner, how to scrape one job, and how to describe a job in the log. The two
|
||||
~300-line workers used to copy this whole loop verbatim.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
import logging
|
||||
import random
|
||||
import signal
|
||||
from abc import ABC, abstractmethod
|
||||
from dataclasses import dataclass
|
||||
|
||||
import nodriver as uc
|
||||
|
||||
from .c2 import C2Client
|
||||
from .config import Settings
|
||||
from .proxy import LocalForwardingProxy, iproyal_password, new_session_id
|
||||
|
||||
|
||||
@dataclass
|
||||
class ScrapeResult:
|
||||
"""What a single job scrape yields. `wire_bytes` is the metered (compressed) cost."""
|
||||
items: list
|
||||
pages: int
|
||||
reason: str
|
||||
wire_bytes: int = 0
|
||||
|
||||
|
||||
def looks_like_challenge(body: str) -> bool:
|
||||
"""True for an actual Cloudflare interstitial (or an empty body). Keyed on CF markers,
|
||||
NOT a leading '<' — a real market page IS html, so a startswith('<') check would flag
|
||||
every good page fetch as a challenge."""
|
||||
b = body or ""
|
||||
return not b.strip() or "Just a moment" in b or "challenge-platform" in b
|
||||
|
||||
|
||||
async def page_fetch(page, url: str, accept: str = "application/json") -> tuple[int, str, int]:
|
||||
"""Fetch in-page from the warm (Cloudflare-cleared) session and read back the Resource
|
||||
Timing transferSize — the actual compressed bytes the metered proxy bills (or -1 when
|
||||
cross-origin timing isn't exposed). Returns (status, body, wire_bytes). Use
|
||||
accept='text/html' for an SSR page payload, the default JSON for an API."""
|
||||
expr = (
|
||||
f"fetch({url!r}, {{credentials:'include', headers:{{'accept': {accept!r}}}}})"
|
||||
f".then(async r => {{"
|
||||
f" const body = await r.text();"
|
||||
f" const e = performance.getEntriesByName({url!r}).slice(-1)[0];"
|
||||
f" return JSON.stringify({{status: r.status, body: body, wire: e ? e.transferSize : -1}});"
|
||||
f"}}).catch(e => JSON.stringify({{status: -1, body: String(e), wire: -1}}))"
|
||||
)
|
||||
raw = await page.evaluate(expr, await_promise=True)
|
||||
if not isinstance(raw, str):
|
||||
return (-1, "", -1)
|
||||
try:
|
||||
obj = json.loads(raw)
|
||||
return (int(obj.get("status", -1)), obj.get("body", ""), int(obj.get("wire", -1)))
|
||||
except (json.JSONDecodeError, ValueError, TypeError):
|
||||
return (-1, raw, -1)
|
||||
|
||||
|
||||
async def click(page, text: str, timeout: int = 3) -> bool:
|
||||
"""Best-match click on visible text; swallow the not-found/timeout case."""
|
||||
try:
|
||||
el = await page.find(text, best_match=True, timeout=timeout)
|
||||
if el:
|
||||
await el.click()
|
||||
return True
|
||||
except Exception:
|
||||
pass
|
||||
return False
|
||||
|
||||
|
||||
class Worker(ABC):
|
||||
# Per-market constants, set by the subclass.
|
||||
name: str = "worker"
|
||||
jobs_path: str = "/jobs"
|
||||
default_market_url: str = ""
|
||||
|
||||
def __init__(self, settings: Settings):
|
||||
self.settings = settings
|
||||
self.market_url = settings.market_url or self.default_market_url
|
||||
self.c2 = C2Client(settings.c2_url, settings.token, self.jobs_path)
|
||||
self.log = logging.getLogger(self.name)
|
||||
self._forwarder: LocalForwardingProxy | None = None
|
||||
self._session_id: str | None = None
|
||||
self._stop = asyncio.Event()
|
||||
|
||||
# --- hooks a market worker overrides ------------------------------------------
|
||||
|
||||
@abstractmethod
|
||||
async def scrape_job(self, page, job) -> ScrapeResult:
|
||||
"""Scrape ALL listings for one job and return them."""
|
||||
|
||||
@abstractmethod
|
||||
def describe_job(self, job) -> str:
|
||||
"""One-line job description for the log (e.g. the search term or slug)."""
|
||||
|
||||
async def dismiss_consent(self, page) -> str | None:
|
||||
"""Dismiss the cookie banner privacy-first; return a note, or None if absent.
|
||||
Default: nothing to do. Markets with a banner override this."""
|
||||
return None
|
||||
|
||||
# --- shared machinery ---------------------------------------------------------
|
||||
|
||||
def _iproyal_password(self, session_id: str) -> str:
|
||||
s = self.settings
|
||||
return iproyal_password(s.iproyal_password, s.iproyal_country, s.iproyal_lifetime_min, session_id)
|
||||
|
||||
async def _pace(self, page) -> None:
|
||||
await page.sleep(self.settings.delay + random.uniform(0, self.settings.jitter))
|
||||
|
||||
async def warm(self, page) -> None:
|
||||
"""Open the market and clear Cloudflare so the session holds cf_clearance."""
|
||||
s = self.settings
|
||||
self.log.info("warming session at %s (clear Cloudflare; %ds)", self.market_url, s.solve_seconds)
|
||||
await page.get(self.market_url)
|
||||
await page.sleep(s.solve_seconds)
|
||||
note = await self.dismiss_consent(page)
|
||||
self.log.info("consent: %s", note or "left up")
|
||||
|
||||
async def _setup_proxy(self) -> tuple[str | None, str]:
|
||||
"""IPRoyal (auth'd, per-worker sticky IP) takes priority; else a plain auth-free
|
||||
PROXY; else this host's own IP. Returns (proxy_endpoint, human_label)."""
|
||||
s = self.settings
|
||||
if s.use_iproyal:
|
||||
self._session_id = new_session_id()
|
||||
self._forwarder = await LocalForwardingProxy(
|
||||
s.iproyal_host, s.iproyal_port, s.iproyal_username,
|
||||
self._iproyal_password(self._session_id)).start()
|
||||
label = f"iproyal[{s.iproyal_country or 'any'}] session {self._session_id} via {self._forwarder.endpoint}"
|
||||
return self._forwarder.endpoint, label
|
||||
return s.proxy, (s.proxy or "own IP")
|
||||
|
||||
def _browser_args(self, proxy: str | None) -> list[str]:
|
||||
s = self.settings
|
||||
args = [f"--proxy-server={proxy}"] if proxy else []
|
||||
if not s.load_images:
|
||||
# Disable image loading at the engine level — the dominant bandwidth cost on
|
||||
# an image-heavy market, and unneeded for CF clearance or the JSON API.
|
||||
args.append("--blink-settings=imagesEnabled=false")
|
||||
if s.chrome_no_sandbox:
|
||||
# Required when running Chromium as root in a container.
|
||||
args += ["--no-sandbox", "--disable-dev-shm-usage"]
|
||||
return args
|
||||
|
||||
async def _on_challenge(self, page) -> None:
|
||||
"""The exit IP is likely flagged. On IPRoyal, rotate to a fresh sticky session
|
||||
(new IP) before re-warming; otherwise just re-solve in place."""
|
||||
if self._forwarder is not None:
|
||||
self._session_id = new_session_id()
|
||||
self._forwarder.set_password(self._iproyal_password(self._session_id))
|
||||
self.log.warning("challenged; rotating exit IP -> session %s, re-warming", self._session_id)
|
||||
else:
|
||||
self.log.warning("challenged; re-warming session")
|
||||
await self.warm(page)
|
||||
|
||||
def _log_result(self, res: ScrapeResult, posted: dict | None, total_wire: int) -> None:
|
||||
if posted:
|
||||
summary = (f"matched {posted.get('matched')}, new {posted.get('inserted')}, "
|
||||
f"upd {posted.get('updated')}, removed {posted.get('removed')}")
|
||||
else:
|
||||
summary = "post failed"
|
||||
self.log.info("scraped %d items (%dp, %s, %.0fKB wire) -> %s [lifetime %.1fMB]",
|
||||
len(res.items), res.pages, res.reason, res.wire_bytes / 1024,
|
||||
summary, total_wire / 1_048_576)
|
||||
|
||||
def _install_signal_handlers(self) -> None:
|
||||
"""Stop the loop on SIGINT/SIGTERM so `docker stop` shuts down cleanly. Not
|
||||
supported on Windows (ProactorEventLoop) — there Ctrl-C still raises
|
||||
KeyboardInterrupt, which the run loop's finally handles just as well."""
|
||||
try:
|
||||
loop = asyncio.get_running_loop()
|
||||
for sig in (signal.SIGINT, signal.SIGTERM):
|
||||
loop.add_signal_handler(sig, self._stop.set)
|
||||
except (NotImplementedError, AttributeError):
|
||||
pass
|
||||
|
||||
async def _idle(self) -> None:
|
||||
"""Sleep when the C2 has no work, but wake immediately on shutdown."""
|
||||
try:
|
||||
await asyncio.wait_for(self._stop.wait(), timeout=self.settings.idle_seconds)
|
||||
except asyncio.TimeoutError:
|
||||
pass
|
||||
|
||||
async def run(self) -> None:
|
||||
self._install_signal_handlers()
|
||||
s = self.settings
|
||||
proxy, proxy_label = await self._setup_proxy()
|
||||
self.log.info("starting (C2=%s, proxy=%s, images=%s)",
|
||||
s.c2_url, proxy_label, "on" if s.load_images else "off")
|
||||
browser = await uc.start(
|
||||
headless=False, browser_executable_path=s.browser_path,
|
||||
browser_args=self._browser_args(proxy))
|
||||
try:
|
||||
page = await browser.get("about:blank")
|
||||
await self.warm(page)
|
||||
|
||||
total_wire = 0 # metered (compressed) bytes pulled, lifetime
|
||||
while not self._stop.is_set():
|
||||
job = await self.c2.get_job()
|
||||
if not job:
|
||||
await self._idle()
|
||||
continue
|
||||
|
||||
self.log.info("job %s — %s", job["jobId"][:8], self.describe_job(job))
|
||||
res = await self.scrape_job(page, job)
|
||||
total_wire += res.wire_bytes
|
||||
|
||||
if res.reason == "challenged":
|
||||
await self._on_challenge(page)
|
||||
|
||||
posted = await self.c2.post_result(job["jobId"], {
|
||||
"items": res.items, "pages": res.pages, "stoppedReason": res.reason})
|
||||
self._log_result(res, posted, total_wire)
|
||||
|
||||
await self._pace(page)
|
||||
finally:
|
||||
self.log.info("shutting down")
|
||||
browser.stop()
|
||||
if self._forwarder is not None:
|
||||
await self._forwarder.stop()
|
||||
|
||||
|
||||
def run(worker_cls: type[Worker]) -> None:
|
||||
"""Boot a worker from the environment: parse config, set up logging, run the loop on
|
||||
nodriver's event loop. The thin market scripts call this and nothing else."""
|
||||
from . import log as log_setup
|
||||
|
||||
settings = Settings.from_env()
|
||||
log_setup.configure(settings.log_level, settings.log_json)
|
||||
uc.loop().run_until_complete(worker_cls(settings).run())
|
||||
129
worker/csmoney_worker.py
Normal file
129
worker/csmoney_worker.py
Normal file
@@ -0,0 +1,129 @@
|
||||
"""cs.money scrape worker (pull model).
|
||||
|
||||
A thin strategy over blworker.Worker: it supplies only the cs.money-specific bits — the
|
||||
consent banner steps and how to scrape one skin+wear's sell-orders. The warm session, the
|
||||
poll/scrape/post loop, the IPRoyal proxy and IP rotation, logging and shutdown all live in
|
||||
the shared runtime. Env knobs are documented in worker/README.md.
|
||||
|
||||
cs.money is an Astro SSR app: the free-text market search filters server-side and the
|
||||
resulting listings are embedded in the page as a __page-params JSON blob. The
|
||||
/2.0/market/sell-orders API rejects a `search` param (HTTP 400), so we fetch the PAGE for
|
||||
a search and read the embedded items — same item shape as the API.
|
||||
|
||||
A page returns at most 60 and offset is ignored, so we paginate with a FORWARD CURSOR on
|
||||
float: cs.money honors `order=asc&sort=float` + `minFloat`, and float is full-precision and
|
||||
effectively unique per item. We grab the 60 lowest-float items at/above `lo`, advance `lo`
|
||||
to the highest float returned, and repeat until a page is under the cap. (The old
|
||||
minPrice/maxPrice bisection silently truncated cheap skins: >60 listings can share a
|
||||
sub-$0.02 reference band, which no price window can split — floats almost never tie, so the
|
||||
cursor always makes progress.)
|
||||
|
||||
cd worker
|
||||
.venv\\Scripts\\Activate.ps1
|
||||
pip install -r requirements.txt
|
||||
python csmoney_worker.py
|
||||
"""
|
||||
|
||||
import json
|
||||
import re
|
||||
import urllib.parse
|
||||
|
||||
from blworker import ScrapeResult, Worker, click, page_fetch, run
|
||||
|
||||
PAGE = ("https://cs.money/market/buy/?search={search}"
|
||||
"&order=asc&sort=float&minFloat={lo:.12f}&maxFloat=1")
|
||||
PAGE_CAP = 60 # items per SSR page
|
||||
PAGE_PARAMS_RE = re.compile(
|
||||
r'<script\b[^>]*id="__page-params"[^>]*>(.*?)</script>', re.S)
|
||||
|
||||
|
||||
def extract_items(html: str) -> list:
|
||||
"""Pull inventory.items out of the page's __page-params JSON blob."""
|
||||
m = PAGE_PARAMS_RE.search(html)
|
||||
if not m:
|
||||
return []
|
||||
try:
|
||||
return json.loads(m.group(1)).get("inventory", {}).get("items", []) or []
|
||||
except json.JSONDecodeError:
|
||||
return []
|
||||
|
||||
|
||||
class CsMoneyWorker(Worker):
|
||||
name = "csmoney"
|
||||
jobs_path = "/jobs"
|
||||
default_market_url = "https://cs.money/market/buy/"
|
||||
|
||||
def describe_job(self, job) -> str:
|
||||
return f"search {job['search']!r}"
|
||||
|
||||
async def dismiss_consent(self, page) -> str | None:
|
||||
"""Privacy-preserving. The banner only offers 'Accept all' / 'Manage cookies';
|
||||
the Reject-all control lives inside the Manage window. So: Manage -> Reject all ->
|
||||
Confirm. (The data path reads SSR __page-params regardless, but this keeps the
|
||||
session honest and unblocks any future interaction.)"""
|
||||
steps = []
|
||||
if await click(page, "Manage cookies") or await click(page, "Manage"):
|
||||
await page.sleep(1)
|
||||
if await click(page, "Reject all"):
|
||||
steps.append("reject-all")
|
||||
for c in ("Confirm my choice", "Confirm", "Save"):
|
||||
if await click(page, c):
|
||||
steps.append(f"confirm:{c}")
|
||||
break
|
||||
return ", ".join(steps) if steps else None
|
||||
|
||||
async def scrape_job(self, page, job) -> ScrapeResult:
|
||||
"""Scrape ALL listings for one skin+wear via a forward float cursor.
|
||||
|
||||
Grab the 60 lowest-float items at/above `lo`, advance `lo` to the highest float on
|
||||
the page, repeat until a page is under the cap. The boundary item is re-fetched
|
||||
(minFloat is inclusive) and dropped by the id dedup."""
|
||||
search = urllib.parse.quote_plus(job["search"])
|
||||
max_fetches = job.get("maxPages", 40) # safety cap on page fetches per job
|
||||
seen: dict = {}
|
||||
fetches = 0
|
||||
wire = 0
|
||||
lo = 0.0
|
||||
reason = "completed"
|
||||
|
||||
while fetches < max_fetches:
|
||||
_status, body, wbytes = await page_fetch(page, PAGE.format(search=search, lo=lo))
|
||||
fetches += 1
|
||||
if wbytes > 0:
|
||||
wire += wbytes
|
||||
|
||||
if "Just a moment" in body or "challenge-platform" in body:
|
||||
return ScrapeResult(list(seen.values()), fetches, "challenged", wire)
|
||||
|
||||
items = extract_items(body)
|
||||
floats = []
|
||||
for it in items:
|
||||
if it.get("id") is not None:
|
||||
seen[it["id"]] = it
|
||||
fl = (it.get("asset") or {}).get("float")
|
||||
if isinstance(fl, (int, float)):
|
||||
floats.append(fl)
|
||||
|
||||
if len(items) < PAGE_CAP:
|
||||
break # last page — fewer than the cap means we've seen everything
|
||||
|
||||
# Advance the cursor past the highest float on this page. Items at exactly that
|
||||
# float are re-fetched next round (minFloat is inclusive) and deduped by id.
|
||||
nxt = max(floats) if floats else None
|
||||
if nxt is None or nxt <= lo:
|
||||
# Cursor can't advance: >60 listings share a single float value, or the
|
||||
# items carry no float. Bail loudly rather than spin — a flagged gap beats
|
||||
# a silent one (this is the failure the price-window version hid).
|
||||
reason = "stuck-float-tie"
|
||||
break
|
||||
lo = nxt
|
||||
|
||||
await self._pace(page)
|
||||
else:
|
||||
reason = "fetch-cap"
|
||||
|
||||
return ScrapeResult(list(seen.values()), fetches, reason, wire)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run(CsMoneyWorker)
|
||||
@@ -1,71 +0,0 @@
|
||||
"""
|
||||
Diagnose the cs.money cookie-consent banner so we can dismiss it programmatically.
|
||||
It's likely a Shadow DOM web component (CookieConsentSystem), which is why
|
||||
document.querySelectorAll-based clicks miss the real buttons.
|
||||
|
||||
Saves:
|
||||
captures/_consent.png - screenshot (so we can SEE the banner + button positions)
|
||||
captures/_consent.txt - shadow-host tags + every consent-like button found by
|
||||
piercing shadow roots, with center coordinates.
|
||||
|
||||
cd worker; .venv\\Scripts\\Activate.ps1
|
||||
python diag_consent.py
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import pathlib
|
||||
|
||||
import nodriver as uc
|
||||
|
||||
URL = os.environ.get("URL", "https://cs.money/market/buy/?search=ak-47+redline")
|
||||
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
|
||||
BROWSER_PATH = os.environ.get("BROWSER_PATH")
|
||||
OUT = pathlib.Path(__file__).parent / "captures"
|
||||
|
||||
# Pierce shadow roots to find consent buttons + their viewport-center coords.
|
||||
DEEP_FIND = r"""
|
||||
JSON.stringify((()=>{
|
||||
const hits=[], hosts=[];
|
||||
function walk(root){
|
||||
root.querySelectorAll('*').forEach(e=>{
|
||||
if(e.shadowRoot){ hosts.push(e.tagName.toLowerCase()); walk(e.shadowRoot); }
|
||||
const t=(e.textContent||'').trim();
|
||||
if(t.length<40 && /accept all|manage cookies|reject all|confirm my choice|^accept$|^manage$/i.test(t)){
|
||||
const r=e.getBoundingClientRect();
|
||||
if(r.width>0&&r.height>0)
|
||||
hits.push({tag:e.tagName, text:t, x:Math.round(r.x+r.width/2), y:Math.round(r.y+r.height/2)});
|
||||
}
|
||||
});
|
||||
}
|
||||
walk(document);
|
||||
return {shadowHosts:[...new Set(hosts)], buttons:hits};
|
||||
})())
|
||||
"""
|
||||
|
||||
|
||||
async def main():
|
||||
OUT.mkdir(exist_ok=True)
|
||||
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH)
|
||||
try:
|
||||
page = await browser.get(URL)
|
||||
print(f"Loaded {URL}; waiting {SOLVE_SECONDS}s for Cloudflare...")
|
||||
await page.sleep(SOLVE_SECONDS)
|
||||
|
||||
png = str(OUT / "_consent.png")
|
||||
await page.save_screenshot(png)
|
||||
print(f"screenshot -> {png}")
|
||||
|
||||
raw = await page.evaluate(DEEP_FIND)
|
||||
info = json.loads(raw) if isinstance(raw, str) else {"error": repr(raw)}
|
||||
(OUT / "_consent.txt").write_text(json.dumps(info, indent=2), encoding="utf-8")
|
||||
print("shadow hosts:", info.get("shadowHosts"))
|
||||
print("consent buttons found:")
|
||||
for b in info.get("buttons", []):
|
||||
print(f" {b}")
|
||||
finally:
|
||||
browser.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
@@ -1,183 +0,0 @@
|
||||
"""
|
||||
Discover how cs.money paginates a filtered search past the initial ~60 SSR items.
|
||||
|
||||
Tests two hypotheses against a high-result search (default "ak-47 redline", which has
|
||||
well over 60 listings):
|
||||
|
||||
A. Does the SSR page honor offset/limit in the URL? Fetch ?search=...&offset=60 and
|
||||
?search=...&limit=120 and compare item ids to page 1. If disjoint/larger, we can
|
||||
paginate cheaply by re-fetching the page.
|
||||
B. The real client "load more": scroll hard to trigger lazy-load and capture any
|
||||
cs.money /2.0/ XHR via Resource Timing — that request carries the structured
|
||||
filter params + offset, i.e. a lighter direct-API pagination path.
|
||||
|
||||
Findings are printed and saved to captures/_pagination.txt.
|
||||
|
||||
cd worker; .venv\\Scripts\\Activate.ps1
|
||||
python discover_pagination.py
|
||||
$env:SEARCH="ak-47 redline"; python discover_pagination.py # override the search
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import pathlib
|
||||
import re
|
||||
|
||||
import nodriver as uc
|
||||
from nodriver import cdp
|
||||
|
||||
SEARCH = os.environ.get("SEARCH", "ak-47 redline")
|
||||
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
|
||||
BROWSER_PATH = os.environ.get("BROWSER_PATH")
|
||||
PROXY = os.environ.get("PROXY")
|
||||
|
||||
BASE = "https://cs.money/market/buy/"
|
||||
PAGE_PARAMS_RE = re.compile(r'<script\b[^>]*id="__page-params"[^>]*>(.*?)</script>', re.S)
|
||||
OUT = pathlib.Path(__file__).parent / "captures"
|
||||
CONSENT = ["Reject all", "Only necessary", "Reject", "Decline", "Deny"]
|
||||
|
||||
# Aggressive scroll: window + every scrollable container (the grid scrolls in a div,
|
||||
# which is why a plain window.scrollTo didn't trigger lazy-load before).
|
||||
SCROLL_JS = (
|
||||
"window.scrollTo(0, document.body.scrollHeight);"
|
||||
"document.querySelectorAll('*').forEach(e=>{"
|
||||
" if (e.scrollHeight > e.clientHeight + 80) e.scrollTop = e.scrollHeight;});")
|
||||
|
||||
|
||||
async def js(page, expr):
|
||||
raw = await page.evaluate(f"JSON.stringify({expr})")
|
||||
try:
|
||||
return json.loads(raw) if isinstance(raw, str) else None
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return None
|
||||
|
||||
|
||||
async def fetch_text(page, url):
|
||||
expr = (f"fetch({url!r},{{credentials:'include'}}).then(async r=>"
|
||||
f"JSON.stringify({{status:r.status, body:await r.text()}}))")
|
||||
raw = await page.evaluate(expr, await_promise=True)
|
||||
try:
|
||||
o = json.loads(raw)
|
||||
return o.get("status"), o.get("body", "")
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return None, ""
|
||||
|
||||
|
||||
def page_item_ids(html):
|
||||
m = PAGE_PARAMS_RE.search(html or "")
|
||||
if not m:
|
||||
return []
|
||||
try:
|
||||
return [it.get("id") for it in json.loads(m.group(1)).get("inventory", {}).get("items", [])]
|
||||
except json.JSONDecodeError:
|
||||
return []
|
||||
|
||||
|
||||
async def click_visible(page, pattern):
|
||||
"""Click the first VISIBLE element whose trimmed text matches `pattern` (case-
|
||||
insensitive). nodriver's find() was matching hidden/duplicate nodes; restricting
|
||||
to offsetParent!=null + short text hits the real button."""
|
||||
expr = ("JSON.stringify((()=>{"
|
||||
"const re=new RegExp(" + json.dumps(pattern) + ",'i');"
|
||||
"const els=[...document.querySelectorAll('button,a,[role=\"button\"],span,div')];"
|
||||
"const b=els.find(e=>e.offsetParent!==null && (e.textContent||'').trim().length<40 "
|
||||
"&& re.test((e.textContent||'').trim()));"
|
||||
"if(b){b.click();return true}return false})())")
|
||||
r = await page.evaluate(expr)
|
||||
return isinstance(r, str) and "true" in r
|
||||
|
||||
|
||||
async def banner_present(page):
|
||||
r = await page.evaluate(
|
||||
"JSON.stringify(/Manage cookies|Accept all/i.test(document.body.innerText||''))")
|
||||
return isinstance(r, str) and "true" in r
|
||||
|
||||
|
||||
async def dismiss(page):
|
||||
"""Privacy-preserving first (Manage -> Reject all -> Confirm); if the banner is
|
||||
still up, fall back to Accept all so the page becomes interactive (discovery
|
||||
needs scrolling to work)."""
|
||||
steps = []
|
||||
if await click_visible(page, "manage cookies|^manage$"):
|
||||
steps.append("manage")
|
||||
await page.sleep(1.2)
|
||||
if await click_visible(page, "reject all"):
|
||||
steps.append("reject-all")
|
||||
await page.sleep(0.4)
|
||||
for c in ("confirm my choice", "^confirm$", "^save$"):
|
||||
if await click_visible(page, c):
|
||||
steps.append("confirm")
|
||||
break
|
||||
await page.sleep(1)
|
||||
if await banner_present(page):
|
||||
steps.append("still-up->accept" if await click_visible(page, "accept all|^accept$") else "still-up")
|
||||
await page.sleep(0.5)
|
||||
steps.append("gone" if not await banner_present(page) else "STILL-PRESENT")
|
||||
return ", ".join(steps)
|
||||
|
||||
|
||||
async def main():
|
||||
OUT.mkdir(exist_ok=True)
|
||||
args = [f"--proxy-server={PROXY}"] if PROXY else []
|
||||
args.append("--blink-settings=imagesEnabled=false")
|
||||
from urllib.parse import quote_plus
|
||||
q = quote_plus(SEARCH)
|
||||
findings = []
|
||||
|
||||
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH, browser_args=args)
|
||||
try:
|
||||
url0 = f"{BASE}?search={q}"
|
||||
page = await browser.get(url0)
|
||||
print(f"Warming on {url0} ({SOLVE_SECONDS}s for Cloudflare)...")
|
||||
await page.sleep(SOLVE_SECONDS)
|
||||
print(f"Consent: {await dismiss(page)}")
|
||||
|
||||
# --- A. URL offset/limit on the SSR page ---
|
||||
_, h0 = await fetch_text(page, f"{BASE}?search={q}")
|
||||
_, h1 = await fetch_text(page, f"{BASE}?search={q}&offset=60")
|
||||
_, h2 = await fetch_text(page, f"{BASE}?search={q}&limit=120")
|
||||
a, b, c = page_item_ids(h0), page_item_ids(h1), page_item_ids(h2)
|
||||
overlap = len(set(a) & set(b))
|
||||
findings.append(f"page1 ids={len(a)} offset=60 ids={len(b)} (overlap with page1={overlap}) limit=120 ids={len(c)}")
|
||||
findings.append(f" -> offset works? {'YES (disjoint)' if b and overlap == 0 else 'no/ignored'}")
|
||||
findings.append(f" -> limit works? {'YES (>60)' if len(c) > 60 else 'no/ignored'}")
|
||||
|
||||
# --- B. Trigger client load-more, capture cs.money /2.0/ XHRs ---
|
||||
# Infinite scroll only fires on GRADUAL downward scrolling — jumping to the
|
||||
# bottom skips the trigger. So step down in small wheel increments and watch
|
||||
# the item count grow.
|
||||
before = set(await js(page, "performance.getEntriesByType('resource').map(e=>e.name)") or [])
|
||||
async def card_count():
|
||||
n = await page.evaluate(
|
||||
"JSON.stringify(document.querySelectorAll('[href*=\"/item/\"],[class*=\"item\" i]').length)")
|
||||
return n
|
||||
print(f" cards before scroll: {await card_count()}")
|
||||
for step in range(60):
|
||||
try:
|
||||
await page.send(cdp.input_.dispatch_mouse_event(
|
||||
type_="mouseWheel", x=720, y=450, delta_x=0, delta_y=500))
|
||||
except Exception:
|
||||
pass
|
||||
await page.sleep(0.7)
|
||||
if step % 15 == 14:
|
||||
now = [u for u in (await js(page, "performance.getEntriesByType('resource').map(e=>e.name)") or [])
|
||||
if u not in before and "cs.money" in u and "metrics." not in u and "traces." not in u]
|
||||
print(f" step {step+1}: cards={await card_count()} new cs.money reqs={len(now)}")
|
||||
after = await js(page, "performance.getEntriesByType('resource').map(e=>e.name)") or []
|
||||
new_xhrs = [u for u in after if u not in before and "cs.money" in u
|
||||
and "metrics." not in u and "traces." not in u]
|
||||
findings.append(f"\nclient requests after scrolling ({len(new_xhrs)} new cs.money):")
|
||||
findings.extend(f" {u}" for u in dict.fromkeys(new_xhrs))
|
||||
if not new_xhrs:
|
||||
findings.append(" (none — grid may not lazy-load via XHR, or scroll didn't reach the trigger)")
|
||||
|
||||
report = "\n".join(findings)
|
||||
print("\n=== FINDINGS ===\n" + report)
|
||||
(OUT / "_pagination.txt").write_text(f"search: {SEARCH}\n\n{report}\n", encoding="utf-8")
|
||||
print(f"\nsaved to {OUT / '_pagination.txt'}")
|
||||
finally:
|
||||
browser.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
@@ -1,96 +0,0 @@
|
||||
"""
|
||||
Find cs.money's price-filter URL param (the basis for price-bucket pagination).
|
||||
|
||||
The market has a Price from/to filter in the sidebar. `search=` works via the URL and
|
||||
the page SSRs the filtered listings into __page-params, so a price param likely works
|
||||
the same way. We baseline the cheapest set, then try candidate param names with a high
|
||||
floor and check whether the returned listings actually shift above it.
|
||||
|
||||
cd worker; .venv\\Scripts\\Activate.ps1
|
||||
python discover_price_param.py
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import pathlib
|
||||
import re
|
||||
from urllib.parse import quote_plus
|
||||
|
||||
import nodriver as uc
|
||||
|
||||
SEARCH = os.environ.get("SEARCH", "ak-47 redline")
|
||||
FLOOR = float(os.environ.get("FLOOR", "200"))
|
||||
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
|
||||
BROWSER_PATH = os.environ.get("BROWSER_PATH")
|
||||
BASE = "https://cs.money/market/buy/"
|
||||
PP = re.compile(r'<script\b[^>]*id="__page-params"[^>]*>(.*?)</script>', re.S)
|
||||
OUT = pathlib.Path(__file__).parent / "captures"
|
||||
|
||||
# Param-name variants for a price floor (and a couple of from/to pairs).
|
||||
CANDIDATES = [
|
||||
"minPrice", "priceFrom", "price_from", "priceMin", "min_price",
|
||||
"priceGte", "from", "price_min", "minprice", "price.gte", "pricegte",
|
||||
]
|
||||
|
||||
|
||||
async def fetch_prices(page, url):
|
||||
expr = (f"fetch({url!r},{{credentials:'include'}}).then(async r=>"
|
||||
f"JSON.stringify({{status:r.status, body:await r.text()}}))")
|
||||
raw = await page.evaluate(expr, await_promise=True)
|
||||
try:
|
||||
body = json.loads(raw).get("body", "")
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return None
|
||||
m = PP.search(body or "")
|
||||
if not m:
|
||||
return None
|
||||
try:
|
||||
items = json.loads(m.group(1)).get("inventory", {}).get("items", [])
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
return [it.get("pricing", {}) for it in items if it.get("pricing")]
|
||||
|
||||
|
||||
async def main():
|
||||
OUT.mkdir(exist_ok=True)
|
||||
q = quote_plus(SEARCH)
|
||||
lines = []
|
||||
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH,
|
||||
browser_args=["--blink-settings=imagesEnabled=false"])
|
||||
try:
|
||||
page = await browser.get(f"{BASE}?search={q}")
|
||||
print(f"Warming ({SOLVE_SECONDS}s)..."); await page.sleep(SOLVE_SECONDS)
|
||||
|
||||
# Test minPrice/maxPrice semantics directly (old cs.money API used these).
|
||||
tests = [
|
||||
("baseline", f"{BASE}?search={q}"),
|
||||
("maxPrice=200", f"{BASE}?search={q}&maxPrice=200"),
|
||||
("minPrice=300", f"{BASE}?search={q}&minPrice=300"),
|
||||
("minPrice=300&maxPrice=400", f"{BASE}?search={q}&minPrice=300&maxPrice=400"),
|
||||
("minPrice=500&maxPrice=1000", f"{BASE}?search={q}&minPrice=500&maxPrice=1000"),
|
||||
]
|
||||
def rng(pr, field):
|
||||
vals = [p.get(field) for p in pr if isinstance(p.get(field), (int, float))]
|
||||
return (min(vals), max(vals)) if vals else (None, None)
|
||||
|
||||
for name, url in tests:
|
||||
pr = await fetch_prices(page, url)
|
||||
if not pr:
|
||||
lines.append(f"{name:28} -> no items")
|
||||
else:
|
||||
d0, d1 = rng(pr, "default")
|
||||
c0, c1 = rng(pr, "computed")
|
||||
b0, b1 = rng(pr, "basePrice")
|
||||
lines.append(f"{name:28} -> n={len(pr)} default[{d0:.2f},{d1:.2f}] "
|
||||
f"computed[{c0:.2f},{c1:.2f}] base[{b0:.2f},{b1:.2f}]")
|
||||
print(lines[-1])
|
||||
|
||||
(OUT / "_price_param.txt").write_text(
|
||||
f"search={SEARCH} floor={FLOOR}\n\n" + "\n".join(lines), encoding="utf-8")
|
||||
print(f"\nsaved to {OUT/'_price_param.txt'}")
|
||||
finally:
|
||||
browser.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
@@ -15,5 +15,6 @@ x11vnc -display "${DISPLAY_NUM}" -forever -shared -nopw -quiet -bg
|
||||
echo "[entrypoint] starting noVNC on :6080 (open http://localhost:6080/vnc.html)"
|
||||
websockify --web=/usr/share/novnc 6080 localhost:5900 &
|
||||
|
||||
echo "[entrypoint] launching worker"
|
||||
exec python worker.py
|
||||
WORKER_SCRIPT="${WORKER_SCRIPT:-csmoney_worker.py}"
|
||||
echo "[entrypoint] launching ${WORKER_SCRIPT}"
|
||||
exec python "${WORKER_SCRIPT}"
|
||||
|
||||
285
worker/poc.py
285
worker/poc.py
@@ -1,285 +0,0 @@
|
||||
"""
|
||||
Proof-of-concept / pre-fleet validation for the cs.money scraper.
|
||||
|
||||
Proves the things we need before building the C2 + worker fleet:
|
||||
1. nodriver clears cs.money's Cloudflare where .NET Selenium couldn't.
|
||||
2. a single WARM session can page the sell-orders API deeply without re-challenge.
|
||||
3. a free-text market search (e.g. "cyber security ft") can be turned into a
|
||||
filtered sell-orders API call — we DISCOVER the real API params by capturing the
|
||||
request the page itself fires, instead of guessing.
|
||||
|
||||
It opens the market (optionally a search URL) in a real non-headless Chromium, lets
|
||||
you clear Cloudflare, dismisses the cookie banner (privacy-preserving), captures the
|
||||
sell-orders request the page makes, then pages that API from inside the cleared page
|
||||
(same-origin fetch carries cf_clearance), pacing itself and stopping on re-challenge.
|
||||
|
||||
cd worker
|
||||
.venv\\Scripts\\Activate.ps1
|
||||
pip install -r requirements.txt
|
||||
|
||||
python poc.py # whole-market sweep
|
||||
$env:SEARCH="cyber security ft"; python poc.py # targeted: FT M4A4 Cyber Security
|
||||
|
||||
Env knobs (all optional):
|
||||
SEARCH free-text market search; when set, scrape only those results
|
||||
MARKET_URL market page base (default the buy market)
|
||||
SOLVE_SECONDS seconds to wait for you to clear Cloudflare (default 30)
|
||||
PAGES how many offset pages (60 each) to attempt (default 20)
|
||||
START_OFFSET first offset (default 0)
|
||||
DELAY / JITTER base + random seconds between fetches (default 2.0 / 1.5)
|
||||
PROXY host:port for an auth-free proxy (omit to use your own IP)
|
||||
BROWSER_PATH path to Chrome/Edge if auto-detect fails
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import pathlib
|
||||
import random
|
||||
from urllib.parse import quote_plus, urlsplit, parse_qsl, urlencode, urlunsplit
|
||||
|
||||
import nodriver as uc
|
||||
from nodriver import cdp
|
||||
|
||||
SEARCH = os.environ.get("SEARCH")
|
||||
MARKET_URL = os.environ.get("MARKET_URL", "https://cs.money/market/buy/")
|
||||
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
|
||||
PAGES = int(os.environ.get("PAGES", "20"))
|
||||
START_OFFSET = int(os.environ.get("START_OFFSET", "0"))
|
||||
DELAY = float(os.environ.get("DELAY", "2.0"))
|
||||
JITTER = float(os.environ.get("JITTER", "1.5"))
|
||||
PROXY = os.environ.get("PROXY")
|
||||
BROWSER_PATH = os.environ.get("BROWSER_PATH")
|
||||
|
||||
# Fallback template if we fail to capture the page's own request (offset = {}).
|
||||
DEFAULT_TEMPLATE = "https://cs.money/2.0/market/sell-orders?limit=60&offset={}"
|
||||
OUT_DIR = pathlib.Path(__file__).parent / "captures"
|
||||
CONSENT_LABELS = ["Reject all", "Reject All", "Only necessary", "Necessary only",
|
||||
"Reject", "Decline", "Deny"]
|
||||
|
||||
# Filled by the CDP network handler with sell-orders request URLs the page fires.
|
||||
_seen_urls: list[str] = []
|
||||
|
||||
|
||||
def looks_like_challenge(body: str) -> bool:
|
||||
s = (body or "").lstrip()
|
||||
return not s or s.startswith("<") or "Just a moment" in body or "challenge-platform" in body
|
||||
|
||||
|
||||
def decimals(v: float) -> int:
|
||||
r = repr(float(v))
|
||||
return len(r.split(".")[-1]) if "." in r else 0
|
||||
|
||||
|
||||
def template_from(url: str) -> str:
|
||||
"""Turn a captured sell-orders URL into a template with offset as '{}',
|
||||
preserving every other param (the search/filter encoding we want to learn)."""
|
||||
parts = urlsplit(url)
|
||||
q = [(k, v) for k, v in parse_qsl(parts.query, keep_blank_values=True) if k != "offset"]
|
||||
if not any(k == "limit" for k, _ in q):
|
||||
q.append(("limit", "60"))
|
||||
base_q = urlencode(q)
|
||||
new_q = (base_q + "&" if base_q else "") + "offset={}"
|
||||
return urlunsplit((parts.scheme, parts.netloc, parts.path, new_q, ""))
|
||||
|
||||
|
||||
async def dismiss_consent(page) -> str | None:
|
||||
"""Best-effort, privacy-preserving — never clicks 'Accept all'."""
|
||||
for label in CONSENT_LABELS:
|
||||
try:
|
||||
el = await page.find(label, best_match=True, timeout=2)
|
||||
except Exception:
|
||||
el = None
|
||||
if el:
|
||||
try:
|
||||
await el.click()
|
||||
return label
|
||||
except Exception:
|
||||
pass
|
||||
return None
|
||||
|
||||
|
||||
async def fetch_json(page, url: str) -> tuple[str, str]:
|
||||
expr = (
|
||||
f"fetch({url!r}, {{credentials:'include', headers:{{'accept':'application/json'}}}})"
|
||||
f".then(async r => JSON.stringify({{status: r.status, body: await r.text()}}))"
|
||||
)
|
||||
raw = await page.evaluate(expr, await_promise=True)
|
||||
if not isinstance(raw, str):
|
||||
return ("-1", "")
|
||||
try:
|
||||
obj = json.loads(raw)
|
||||
return (str(obj.get("status", "-1")), obj.get("body", ""))
|
||||
except json.JSONDecodeError:
|
||||
return ("-1", raw)
|
||||
|
||||
|
||||
async def main():
|
||||
OUT_DIR.mkdir(exist_ok=True)
|
||||
args = [f"--proxy-server={PROXY}"] if PROXY else []
|
||||
|
||||
target_url = MARKET_URL
|
||||
tag = "market"
|
||||
if SEARCH:
|
||||
sep = "&" if "?" in MARKET_URL else "?"
|
||||
target_url = f"{MARKET_URL}{sep}search={quote_plus(SEARCH)}"
|
||||
tag = "search_" + "".join(c if c.isalnum() else "_" for c in SEARCH)[:40]
|
||||
|
||||
print(f"Launching nodriver Chromium (proxy={PROXY or 'none / own IP'})...")
|
||||
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH, browser_args=args)
|
||||
|
||||
pages_ok = items_total = floats_total = low_prec = 0
|
||||
dp_min, dp_max = 99, 0
|
||||
deepest_offset = None
|
||||
reason = "completed (hit PAGES limit)"
|
||||
|
||||
try:
|
||||
# Open a blank tab first so the network handler is attached BEFORE the page
|
||||
# fires its filtered sell-orders request (otherwise we'd miss it).
|
||||
page = await browser.get("about:blank")
|
||||
|
||||
async def on_request(evt):
|
||||
url = evt.request.url
|
||||
if "/market/sell-orders" in url:
|
||||
_seen_urls.append(url)
|
||||
|
||||
page.add_handler(cdp.network.RequestWillBeSent, on_request)
|
||||
try:
|
||||
await page.send(cdp.network.enable())
|
||||
except Exception as ex:
|
||||
print(f"(network capture unavailable: {ex})")
|
||||
|
||||
print(f"Opening {target_url}")
|
||||
await page.get(target_url)
|
||||
print(f"Solve any Cloudflare challenge. Waiting {SOLVE_SECONDS}s for the grid...")
|
||||
await page.sleep(SOLVE_SECONDS)
|
||||
|
||||
clicked = await dismiss_consent(page)
|
||||
print(f"Consent banner: {'dismissed via ' + clicked if clicked else 'left up (does not block fetch)'}")
|
||||
|
||||
# Reliable discovery via the Resource Timing API: the browser records EVERY
|
||||
# request the page made, so we read the real sell-orders URL straight out of it
|
||||
# (no flaky CDP event timing). Also dump nearby API calls for context.
|
||||
# cs.money is an Astro SSR app — the initial filtered listings are rendered
|
||||
# server-side (no client XHR to capture). Scroll to provoke lazy-load
|
||||
# pagination, which DOES fire a client request carrying the real filter params.
|
||||
print("Scrolling to trigger lazy-load pagination...")
|
||||
for _ in range(6):
|
||||
try:
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
except Exception:
|
||||
pass
|
||||
await page.sleep(2)
|
||||
|
||||
# nodriver returns arrays unreliably from evaluate(), so JSON.stringify in JS
|
||||
# and json.loads here (the string path is proven by fetch_json).
|
||||
async def js_list(expr: str) -> list:
|
||||
raw = await page.evaluate(f"JSON.stringify({expr})")
|
||||
try:
|
||||
return json.loads(raw) if isinstance(raw, str) else []
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return []
|
||||
|
||||
try:
|
||||
all_urls = await js_list("performance.getEntriesByType('resource').map(e=>e.name)")
|
||||
print(f">>> Resource Timing saw {len(all_urls)} requests total")
|
||||
if all_urls:
|
||||
(OUT_DIR / "_all_requests.txt").write_text(
|
||||
"\n".join(dict.fromkeys(all_urls)), encoding="utf-8")
|
||||
sell = [u for u in all_urls if "/market/sell-orders" in u]
|
||||
_seen_urls.extend(sell)
|
||||
api = [u for u in all_urls if "cs.money/" in u and ("/2.0/" in u or "/1.0/" in u)]
|
||||
if api:
|
||||
(OUT_DIR / "_api_calls.txt").write_text("\n".join(dict.fromkeys(api)), encoding="utf-8")
|
||||
print(f">>> {len(set(api))} cs.money API calls; saved to {OUT_DIR / '_api_calls.txt'}")
|
||||
except Exception as ex:
|
||||
print(f"(resource-timing query failed: {ex})")
|
||||
|
||||
# Dump the SSR'd page so we can see how the filter is encoded and where the
|
||||
# listings data lives (Astro embeds island props / hydration JSON in the HTML).
|
||||
try:
|
||||
html = await page.evaluate("document.documentElement.outerHTML")
|
||||
if isinstance(html, str) and html:
|
||||
(OUT_DIR / "_page.html").write_text(html, encoding="utf-8")
|
||||
print(f">>> saved page HTML ({len(html)} bytes) to {OUT_DIR / '_page.html'}")
|
||||
except Exception as ex:
|
||||
print(f"(page HTML dump failed: {ex})")
|
||||
|
||||
# Discovery: what sell-orders request did the page actually make?
|
||||
if _seen_urls:
|
||||
captured = _seen_urls[-1]
|
||||
template = template_from(captured)
|
||||
print("\n>>> DISCOVERED sell-orders API call the page fired:")
|
||||
print(f" {captured}")
|
||||
print(f">>> pagination template: {template}\n")
|
||||
# Persist it — the console line is easy to lose, and this is the one bit
|
||||
# of ground truth (the real filter-param scheme) we need.
|
||||
(OUT_DIR / "_discovered.txt").write_text(
|
||||
"ALL captured sell-orders requests:\n"
|
||||
+ "\n".join(dict.fromkeys(_seen_urls))
|
||||
+ f"\n\npagination template:\n{template}\n",
|
||||
encoding="utf-8")
|
||||
print(f">>> saved to {OUT_DIR / '_discovered.txt'}")
|
||||
else:
|
||||
template = DEFAULT_TEMPLATE
|
||||
if SEARCH:
|
||||
template = template.replace("offset={}", f"search={quote_plus(SEARCH)}&offset={{}}")
|
||||
print(f"\n(no request captured; falling back to template: {template})\n")
|
||||
|
||||
for i in range(PAGES):
|
||||
offset = START_OFFSET + i * 60
|
||||
status, body = await fetch_json(page, template.format(offset))
|
||||
|
||||
if looks_like_challenge(body):
|
||||
print(f" page {i + 1} [offset {offset}]: RE-CHALLENGED (status {status}). Stopping.")
|
||||
(OUT_DIR / f"{tag}_challenge_offset_{offset}.html").write_text(body, encoding="utf-8")
|
||||
reason = f"re-challenged at offset {offset}"
|
||||
break
|
||||
|
||||
try:
|
||||
items = json.loads(body).get("items", [])
|
||||
except json.JSONDecodeError:
|
||||
print(f" page {i + 1} [offset {offset}]: non-JSON (status {status}). Stopping.")
|
||||
reason = f"non-JSON at offset {offset}"
|
||||
break
|
||||
|
||||
if not items:
|
||||
print(f" page {i + 1} [offset {offset}]: 0 items — end of results.")
|
||||
reason = "end of results"
|
||||
break
|
||||
|
||||
(OUT_DIR / f"{tag}_offset_{offset:06d}.json").write_text(body, encoding="utf-8")
|
||||
pages_ok += 1
|
||||
deepest_offset = offset
|
||||
items_total += len(items)
|
||||
names = set()
|
||||
for it in items:
|
||||
fl = it.get("asset", {}).get("float")
|
||||
if fl is not None:
|
||||
floats_total += 1
|
||||
d = decimals(fl)
|
||||
dp_min, dp_max = min(dp_min, d), max(dp_max, d)
|
||||
if d <= 6: # short repr — exact binary fraction (e.g. 1/16), not truncation
|
||||
low_prec += 1
|
||||
names.add(it.get("asset", {}).get("names", {}).get("full"))
|
||||
sample = next(iter(names), None) if SEARCH else None
|
||||
print(f" page {i + 1} [offset {offset}] OK — {len(items)} items"
|
||||
+ (f" (e.g. {sample}; {len(names)} distinct names)" if SEARCH else ""))
|
||||
|
||||
await page.sleep(DELAY + random.uniform(0, JITTER))
|
||||
|
||||
print("\n=== summary ===")
|
||||
print(f" query: {SEARCH or '(whole market)'}")
|
||||
print(f" stopped: {reason}")
|
||||
print(f" clean pages: {pages_ok} deepest offset: {deepest_offset} items: {items_total}")
|
||||
if floats_total:
|
||||
# Truncation would make MANY values short, not one exact binary fraction.
|
||||
verdict = "FULL precision" if low_prec / floats_total < 0.02 else "POSSIBLE TRUNCATION"
|
||||
print(f" floats: {floats_total} items, {dp_max}-decimal max, "
|
||||
f"{low_prec} short-repr (exact fractions) — {verdict}")
|
||||
print(f" files in {OUT_DIR}")
|
||||
finally:
|
||||
browser.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
@@ -1,77 +0,0 @@
|
||||
"""
|
||||
Probe which extra filter params cs.money's SSR market search honors, so we can
|
||||
pick a SECOND pagination axis to break apart dense price bands that saturate the
|
||||
60-cap (see diag_windows.py). For a saturating search we try candidate params and
|
||||
report how the returned set's size + float range + price range change.
|
||||
|
||||
python probe_filters.py "Glock-18 Candy Apple mw"
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import sys
|
||||
|
||||
import nodriver as uc
|
||||
|
||||
import worker
|
||||
|
||||
BASE = "https://cs.money/market/buy/?search={q}"
|
||||
# (label, extra query string) — candidates cs.money markets commonly expose.
|
||||
CANDIDATES = [
|
||||
("baseline", ""),
|
||||
("sort=price asc", "&order=asc&sort=price"),
|
||||
("sort=price desc", "&order=desc&sort=price"),
|
||||
("sort=float", "&sort=float"),
|
||||
("minFloat/maxFloat lo", "&minFloat=0.07&maxFloat=0.10"),
|
||||
("minFloat/maxFloat hi", "&minFloat=0.10&maxFloat=0.15"),
|
||||
("maxWear lo", "&minWear=0.07&maxWear=0.10"),
|
||||
("isStatTrak=true", "&isStatTrak=true"),
|
||||
("hasStickers=false", "&hasStickers=false"),
|
||||
]
|
||||
|
||||
|
||||
def stats(items):
|
||||
floats = [(((it.get("asset") or {}).get("float"))) for it in items]
|
||||
floats = [f for f in floats if isinstance(f, (int, float))]
|
||||
bases = []
|
||||
for it in items:
|
||||
p = it.get("pricing") or {}
|
||||
b = p.get("basePrice", p.get("computed"))
|
||||
if isinstance(b, (int, float)):
|
||||
bases.append(b)
|
||||
fr = f"[{min(floats):.4f},{max(floats):.4f}]" if floats else "[-]"
|
||||
br = f"[{min(bases):.2f},{max(bases):.2f}]" if bases else "[-]"
|
||||
return f"n={len(items):3d} float{fr} base{br}"
|
||||
|
||||
|
||||
async def main():
|
||||
search = " ".join(sys.argv[1:]) or "Glock-18 Candy Apple mw"
|
||||
q = worker.urllib.parse.quote_plus(search)
|
||||
|
||||
args = ["--blink-settings=imagesEnabled=false"]
|
||||
browser = await uc.start(headless=False, browser_args=args)
|
||||
try:
|
||||
page = await browser.get("about:blank")
|
||||
await worker.warm(page)
|
||||
|
||||
base_ids = None
|
||||
for label, extra in CANDIDATES:
|
||||
url = BASE.format(q=q) + extra
|
||||
status, body = await worker.fetch_json(page, url)
|
||||
if "Just a moment" in body or "challenge-platform" in body:
|
||||
print(f" {label:24s} CHALLENGED"); break
|
||||
items = worker.extract_items(body)
|
||||
ids = {it.get("id") for it in items}
|
||||
if label == "baseline":
|
||||
base_ids = ids
|
||||
delta = ""
|
||||
else:
|
||||
# If a param is IGNORED, the set is identical to baseline.
|
||||
delta = "IGNORED (== baseline)" if ids == base_ids else f"CHANGED ({len(ids ^ (base_ids or set()))} diff ids)"
|
||||
print(f" {label:24s} {stats(items)} {delta}")
|
||||
await page.sleep(worker.DELAY)
|
||||
finally:
|
||||
browser.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
@@ -1,5 +1,9 @@
|
||||
# cs.money scraping worker.
|
||||
# Market scraping workers (cs.money, skin.land).
|
||||
# nodriver = the modern successor to undetected-chromedriver: it drives a normal
|
||||
# Chromium over CDP directly (no chromedriver, so none of the cdc_/webdriver tells
|
||||
# that got our .NET Selenium setup insta-challenged by Cloudflare).
|
||||
nodriver>=0.39
|
||||
#
|
||||
# Everything else the workers use is the Python stdlib (asyncio, urllib, logging, json) —
|
||||
# no other third-party deps. Upper bound is a guard against a surprise breaking release;
|
||||
# bump it deliberately after testing a challenge solve.
|
||||
nodriver>=0.39,<0.50
|
||||
|
||||
174
worker/skinland_worker.py
Normal file
174
worker/skinland_worker.py
Normal file
@@ -0,0 +1,174 @@
|
||||
"""skin.land scrape worker (pull model).
|
||||
|
||||
A thin strategy over blworker.Worker, mirroring the cs.money worker — it supplies only the
|
||||
skin.land-specific bits; the warm session, poll/scrape/post loop, IPRoyal proxy, IP
|
||||
rotation, logging and shutdown all live in the shared runtime. Env knobs: worker/README.md.
|
||||
|
||||
How skin.land is scraped (learned from the discovery probes):
|
||||
- A job's target is the market PAGE URL, e.g.
|
||||
https://skin.land/market/csgo/ak-47-redline-field-tested/
|
||||
- That Nuxt page embeds an internal numeric skin_id. We resolve it once from the page's
|
||||
__NUXT__ payload (the skin object whose `url` == the page slug), cache it per slug, then
|
||||
page the clean JSON API:
|
||||
GET https://app.skin.land/api/v2/obtained-skins?skin_id={id}&page={n}
|
||||
which returns a Laravel paginator {data:[...offers], meta:{current_page,last_page,…}}.
|
||||
- We walk pages 1..last_page (capped by the job's maxPages), dedup offers by id, and post.
|
||||
|
||||
cd worker
|
||||
.venv\\Scripts\\Activate.ps1
|
||||
pip install -r requirements.txt
|
||||
python skinland_worker.py
|
||||
"""
|
||||
|
||||
import json
|
||||
import re
|
||||
|
||||
from blworker import ScrapeResult, Worker, click, looks_like_challenge, page_fetch, run
|
||||
|
||||
# The offers API. skin_id is skin.land's internal id (resolved from the page); page is the
|
||||
# Laravel paginator page. Same warm session, fetched in-page (CORS-allowed app subdomain).
|
||||
API = "https://app.skin.land/api/v2/obtained-skins?skin_id={skin_id}&page={page}"
|
||||
|
||||
# The page's Nuxt payload is a devalue flat array; the main skin object is the one whose
|
||||
# `url` field resolves to the page slug, and its `id` field resolves to the skin_id.
|
||||
NUXT_ARRAY_RE = re.compile(r'\[\["(?:ShallowReactive|Reactive)",\d+\]')
|
||||
|
||||
|
||||
def slug_of(url: str) -> str:
|
||||
return url.rstrip("/").rsplit("/", 1)[-1]
|
||||
|
||||
|
||||
def extract_nuxt_array(html: str):
|
||||
"""Pull the Nuxt devalue payload (a JSON flat array of values with index references)
|
||||
out of the page HTML. Returns the parsed list, or None."""
|
||||
m = NUXT_ARRAY_RE.search(html)
|
||||
if not m:
|
||||
return None
|
||||
start = m.start()
|
||||
depth = 0
|
||||
instr = False
|
||||
esc = False
|
||||
for i in range(start, len(html)):
|
||||
ch = html[i]
|
||||
if esc:
|
||||
esc = False
|
||||
continue
|
||||
if ch == "\\":
|
||||
esc = True
|
||||
continue
|
||||
if ch == '"':
|
||||
instr = not instr
|
||||
continue
|
||||
if instr:
|
||||
continue
|
||||
if ch == "[":
|
||||
depth += 1
|
||||
elif ch == "]":
|
||||
depth -= 1
|
||||
if depth == 0:
|
||||
try:
|
||||
return json.loads(html[start:i + 1])
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def resolve_skin_id(html: str, slug: str) -> int | None:
|
||||
"""Find the page's main skin object in the Nuxt payload — the dict whose `url` field
|
||||
resolves to the page slug — and return its resolved `id` (skin.land's internal skin_id
|
||||
used by the obtained-skins API)."""
|
||||
arr = extract_nuxt_array(html)
|
||||
if not arr:
|
||||
return None
|
||||
|
||||
def val(ref):
|
||||
return arr[ref] if isinstance(ref, int) and 0 <= ref < len(arr) else ref
|
||||
|
||||
for el in arr:
|
||||
if isinstance(el, dict) and "url" in el and "id" in el and val(el["url"]) == slug:
|
||||
sid = val(el["id"])
|
||||
if isinstance(sid, int):
|
||||
return sid
|
||||
return None
|
||||
|
||||
|
||||
class SkinLandWorker(Worker):
|
||||
name = "skinland"
|
||||
jobs_path = "/skinland/jobs"
|
||||
default_market_url = "https://skin.land/market/csgo/"
|
||||
|
||||
def __init__(self, settings):
|
||||
super().__init__(settings)
|
||||
# skin_id is stable per skin+wear, so cache it per slug to skip the ~page fetch on
|
||||
# re-sweeps.
|
||||
self._skin_id_cache: dict[str, int] = {}
|
||||
|
||||
def describe_job(self, job) -> str:
|
||||
return slug_of(job["url"])
|
||||
|
||||
async def dismiss_consent(self, page) -> str | None:
|
||||
"""Privacy-preserving: dismiss the cookie banner with essential-only if present."""
|
||||
for label in ("Accept essential", "ACCEPT ESSENTIAL", "Reject all"):
|
||||
if await click(page, label):
|
||||
return f"dismissed via {label!r}"
|
||||
return None
|
||||
|
||||
async def _get_skin_id(self, page, job, slug: str) -> tuple[int | None, str, int]:
|
||||
"""Resolve (and cache) skin.land's skin_id for this slug. Returns
|
||||
(skin_id, reason, wire); reason is "" on success, else a partial-stop reason."""
|
||||
if slug in self._skin_id_cache:
|
||||
return self._skin_id_cache[slug], "", 0
|
||||
|
||||
_status, html, wire = await page_fetch(page, job["url"], accept="text/html")
|
||||
if looks_like_challenge(html):
|
||||
return None, "challenged", max(wire, 0)
|
||||
skin_id = resolve_skin_id(html, slug)
|
||||
if skin_id is None:
|
||||
return None, "no-skin-id", max(wire, 0)
|
||||
self._skin_id_cache[slug] = skin_id
|
||||
return skin_id, "", max(wire, 0)
|
||||
|
||||
async def scrape_job(self, page, job) -> ScrapeResult:
|
||||
"""Scrape ALL offers for one skin+wear by paging the obtained-skins API."""
|
||||
slug = slug_of(job["url"])
|
||||
max_pages = job.get("maxPages", 40)
|
||||
|
||||
skin_id, reason, wire = await self._get_skin_id(page, job, slug)
|
||||
if skin_id is None:
|
||||
return ScrapeResult([], 0, reason, wire)
|
||||
|
||||
seen: dict = {}
|
||||
fetches = 0
|
||||
page_n = 1
|
||||
reason = "completed"
|
||||
while page_n <= max_pages:
|
||||
_status, body, wbytes = await page_fetch(page, API.format(skin_id=skin_id, page=page_n))
|
||||
fetches += 1
|
||||
if wbytes > 0:
|
||||
wire += wbytes
|
||||
|
||||
if looks_like_challenge(body):
|
||||
return ScrapeResult(list(seen.values()), fetches, "challenged", wire)
|
||||
try:
|
||||
payload = json.loads(body)
|
||||
except json.JSONDecodeError:
|
||||
return ScrapeResult(list(seen.values()), fetches, "bad-json", wire)
|
||||
|
||||
for o in payload.get("data") or []:
|
||||
if o.get("id") is not None:
|
||||
seen[o["id"]] = o
|
||||
|
||||
meta = payload.get("meta") or {}
|
||||
last = meta.get("last_page")
|
||||
if not payload.get("data") or (isinstance(last, int) and page_n >= last):
|
||||
break # walked the final page
|
||||
page_n += 1
|
||||
await self._pace(page)
|
||||
else:
|
||||
reason = "fetch-cap"
|
||||
|
||||
return ScrapeResult(list(seen.values()), fetches, reason, wire)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run(SkinLandWorker)
|
||||
@@ -1,77 +0,0 @@
|
||||
"""
|
||||
One-off count verification: scrape a single skin+wear search from cs.money and
|
||||
report how many distinct sell-orders come back, reusing the production worker's
|
||||
warm-session + price-window bisection logic (worker.scrape_job).
|
||||
|
||||
Use it to sanity-check that our pagination actually recovers the FULL listing
|
||||
count cs.money shows on the site (the known ground truth) for one query.
|
||||
|
||||
cd worker
|
||||
.venv\\Scripts\\Activate.ps1
|
||||
python verify_count.py "Desert Eagle Bronze Deco fn"
|
||||
|
||||
Env knobs (same meaning as worker.py): SOLVE_SECONDS, DELAY, JITTER, PROXY,
|
||||
BROWSER_PATH, LOAD_IMAGES. MAX_FETCHES caps window fetches (default 80).
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import sys
|
||||
from collections import Counter
|
||||
|
||||
import nodriver as uc
|
||||
|
||||
import worker
|
||||
|
||||
MAX_FETCHES = int(os.environ.get("MAX_FETCHES", "80"))
|
||||
|
||||
|
||||
async def main():
|
||||
search = " ".join(sys.argv[1:]) or "Desert Eagle Bronze Deco fn"
|
||||
|
||||
args = [f"--proxy-server={worker.PROXY}"] if worker.PROXY else []
|
||||
if not worker.LOAD_IMAGES:
|
||||
args.append("--blink-settings=imagesEnabled=false")
|
||||
if os.environ.get("CHROME_NO_SANDBOX") == "1":
|
||||
args += ["--no-sandbox", "--disable-dev-shm-usage"]
|
||||
|
||||
print(f"Verifying count for search {search!r} (proxy={worker.PROXY or 'own IP'})")
|
||||
browser = await uc.start(
|
||||
headless=False, browser_executable_path=worker.BROWSER_PATH, browser_args=args)
|
||||
try:
|
||||
page = await browser.get("about:blank")
|
||||
await worker.warm(page)
|
||||
|
||||
job = {"search": search, "maxPages": MAX_FETCHES}
|
||||
items, fetches, reason = await worker.scrape_job(page, job)
|
||||
|
||||
print("\n=== result ===")
|
||||
print(f" search: {search}")
|
||||
print(f" stopped: {reason}")
|
||||
print(f" fetches: {fetches}")
|
||||
print(f" DISTINCT sell-orders (deduped by id): {len(items)}")
|
||||
|
||||
# Break down what came back so we can see whether the count is inflated by
|
||||
# off-target names/wears (the C2's name+wear filter would drop those later).
|
||||
names = Counter()
|
||||
wears = Counter()
|
||||
st = 0
|
||||
for it in items:
|
||||
asset = it.get("asset") or {}
|
||||
names[(asset.get("names") or {}).get("full")] += 1
|
||||
wears[asset.get("quality")] += 1
|
||||
if asset.get("isStatTrak"):
|
||||
st += 1
|
||||
print(f" StatTrak in set: {st}")
|
||||
print(" by name:")
|
||||
for name, n in names.most_common():
|
||||
print(f" {n:4d} {name}")
|
||||
print(" by wear (quality code):")
|
||||
for w, n in wears.most_common():
|
||||
print(f" {n:4d} {w}")
|
||||
finally:
|
||||
browser.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
@@ -1,79 +0,0 @@
|
||||
"""
|
||||
Validate the float-cursor scrape by walking the float axis in BOTH directions and
|
||||
comparing the recovered sell-order id sets. If ascending (lowest float first) and
|
||||
descending (highest float first) independently land on the same listings, the
|
||||
cursor is exhaustive and order-independent — i.e. the count is real, not an artifact
|
||||
of walk direction or boundary double-counting.
|
||||
|
||||
python verify_crosscheck.py "Glock-18 Candy Apple mw"
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import sys
|
||||
|
||||
import nodriver as uc
|
||||
|
||||
import worker
|
||||
|
||||
CAP = worker.PAGE_CAP
|
||||
ASC = ("https://cs.money/market/buy/?search={q}"
|
||||
"&order=asc&sort=float&minFloat={cur:.12f}&maxFloat=1")
|
||||
DESC = ("https://cs.money/market/buy/?search={q}"
|
||||
"&order=desc&sort=float&minFloat=0&maxFloat={cur:.12f}")
|
||||
|
||||
|
||||
async def walk(page, q, template, ascending, max_fetches=60):
|
||||
seen = {}
|
||||
cur = 0.0 if ascending else 1.0
|
||||
fetches = 0
|
||||
while fetches < max_fetches:
|
||||
status, body = await worker.fetch_json(page, template.format(q=q, cur=cur))
|
||||
fetches += 1
|
||||
if "Just a moment" in body or "challenge-platform" in body:
|
||||
return seen, fetches, "challenged"
|
||||
items = worker.extract_items(body)
|
||||
floats = []
|
||||
for it in items:
|
||||
if it.get("id") is not None:
|
||||
seen[it["id"]] = it
|
||||
fl = (it.get("asset") or {}).get("float")
|
||||
if isinstance(fl, (int, float)):
|
||||
floats.append(fl)
|
||||
if len(items) < CAP:
|
||||
return seen, fetches, "completed"
|
||||
nxt = (max(floats) if ascending else min(floats)) if floats else None
|
||||
if nxt is None or (ascending and nxt <= cur) or (not ascending and nxt >= cur):
|
||||
return seen, fetches, "stuck"
|
||||
cur = nxt
|
||||
await page.sleep(worker.DELAY)
|
||||
return seen, fetches, "fetch-cap"
|
||||
|
||||
|
||||
async def main():
|
||||
search = " ".join(sys.argv[1:]) or "Glock-18 Candy Apple mw"
|
||||
q = worker.urllib.parse.quote_plus(search)
|
||||
browser = await uc.start(headless=False, browser_args=["--blink-settings=imagesEnabled=false"])
|
||||
try:
|
||||
page = await browser.get("about:blank")
|
||||
await worker.warm(page)
|
||||
|
||||
asc, fa, ra = await walk(page, q, ASC, ascending=True)
|
||||
print(f"ASC : {len(asc):4d} ids {fa} fetches {ra}")
|
||||
desc, fd, rd = await walk(page, q, DESC, ascending=False)
|
||||
print(f"DESC: {len(desc):4d} ids {fd} fetches {rd}")
|
||||
|
||||
a, d = set(asc), set(desc)
|
||||
union = a | d
|
||||
print("\n=== cross-check ===")
|
||||
print(f" ASC only: {len(a - d)}")
|
||||
print(f" DESC only: {len(d - a)}")
|
||||
print(f" in both: {len(a & d)}")
|
||||
print(f" UNION (distinct):{len(union)}")
|
||||
agree = "AGREE — count is solid" if a == d else "DISAGREE — one walk missed listings"
|
||||
print(f" verdict: {agree}")
|
||||
finally:
|
||||
browser.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
483
worker/worker.py
483
worker/worker.py
@@ -1,483 +0,0 @@
|
||||
"""
|
||||
cs.money scrape worker (pull model).
|
||||
|
||||
Holds ONE warm nodriver session (the thing that beats Cloudflare), then loops:
|
||||
poll the .NET C2 for a job, scrape that skin+wear's sell-orders via in-page fetch
|
||||
from the cleared session, and post the results back. The C2 owns job selection
|
||||
(stalest skin+wear first) and persistence; this worker just fetches and forwards.
|
||||
|
||||
cd worker
|
||||
.venv\\Scripts\\Activate.ps1
|
||||
pip install -r requirements.txt
|
||||
python worker.py
|
||||
|
||||
Env knobs:
|
||||
C2_URL C2 base URL (default http://localhost:5080)
|
||||
WORKER_TOKEN shared secret, must match the C2's WorkerToken (default dev-worker-token)
|
||||
MARKET_URL market page to warm the session on (default the buy market)
|
||||
SOLVE_SECONDS seconds to clear Cloudflare on startup (default 30)
|
||||
DELAY / JITTER base + random seconds between page fetches (default 2.0 / 1.5)
|
||||
IDLE_SECONDS sleep when the C2 has no work (default 10)
|
||||
BROWSER_PATH path to Chrome/Edge if auto-detect fails
|
||||
|
||||
Proxy (pick one; IPRoyal takes priority when its creds are set):
|
||||
IPROYAL_USERNAME IPRoyal residential account username
|
||||
IPROYAL_PASSWORD IPRoyal residential account password
|
||||
IPROYAL_COUNTRY ISO country for the exit (default us; blank = any)
|
||||
IPROYAL_LIFETIME_MIN sticky-IP hold in minutes (default 60)
|
||||
PROXY host:port for an auth-free proxy (fallback; omit to use your own IP)
|
||||
|
||||
Each worker process mints its own random IPRoyal sticky session at startup, so N
|
||||
workers get N distinct residential exit IPs with no coordination — scale with
|
||||
`docker compose up --scale worker=N`. On a Cloudflare challenge the worker rotates
|
||||
to a fresh session (new IP) and re-warms. Chromium can't carry proxy credentials on
|
||||
--proxy-server, so we run a tiny in-process forwarder (LocalForwardingProxy below)
|
||||
that injects the IPRoyal auth and chains to the gateway; Chrome talks only to an
|
||||
auth-free 127.0.0.1 endpoint, keeping us at zero CDP (a CDP auth handler is a
|
||||
Cloudflare tell).
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import base64
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import urllib.error
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
import uuid
|
||||
|
||||
import nodriver as uc
|
||||
|
||||
C2_URL = os.environ.get("C2_URL", "http://localhost:5080").rstrip("/")
|
||||
TOKEN = os.environ.get("WORKER_TOKEN", "dev-worker-token")
|
||||
MARKET_URL = os.environ.get("MARKET_URL", "https://cs.money/market/buy/")
|
||||
SOLVE_SECONDS = int(os.environ.get("SOLVE_SECONDS", "30"))
|
||||
DELAY = float(os.environ.get("DELAY", "2.0"))
|
||||
JITTER = float(os.environ.get("JITTER", "1.5"))
|
||||
IDLE_SECONDS = int(os.environ.get("IDLE_SECONDS", "10"))
|
||||
PROXY = os.environ.get("PROXY")
|
||||
BROWSER_PATH = os.environ.get("BROWSER_PATH")
|
||||
|
||||
# IPRoyal residential gateway. One fixed host/port; country, sticky-session id and
|
||||
# lifetime are encoded as underscore params appended to the password (see
|
||||
# _iproyal_password). Mirrors the .NET IpRoyalProxyProvider scheme.
|
||||
IPROYAL_HOST = os.environ.get("IPROYAL_HOST", "geo.iproyal.com")
|
||||
IPROYAL_PORT = int(os.environ.get("IPROYAL_PORT", "12321"))
|
||||
IPROYAL_USERNAME = os.environ.get("IPROYAL_USERNAME")
|
||||
IPROYAL_PASSWORD = os.environ.get("IPROYAL_PASSWORD")
|
||||
IPROYAL_COUNTRY = os.environ.get("IPROYAL_COUNTRY", "us").strip().lower()
|
||||
IPROYAL_LIFETIME_MIN = int(os.environ.get("IPROYAL_LIFETIME_MIN", "60"))
|
||||
# Residential proxy is metered per GB. Cloudflare gates on JS, not images, and the
|
||||
# sell-orders API is pure JSON — so block images by default to slash page-render
|
||||
# bandwidth. Set LOAD_IMAGES=1 to re-enable (e.g. for debugging the visible page).
|
||||
LOAD_IMAGES = os.environ.get("LOAD_IMAGES") == "1"
|
||||
|
||||
# cs.money is an Astro SSR app: the free-text market search filters server-side and
|
||||
# the resulting listings are embedded in the page as a __page-params JSON blob. The
|
||||
# /2.0/market/sell-orders API rejects a `search` param (HTTP 400), so we fetch the
|
||||
# PAGE for a search and read the embedded items — same item shape as the API.
|
||||
#
|
||||
# A page returns at most 60 and offset is ignored, so we paginate with a FORWARD
|
||||
# CURSOR on float: cs.money honors `order=asc&sort=float` + `minFloat`, and float is
|
||||
# full-precision and effectively unique per item. We grab the 60 lowest-float items
|
||||
# at/above `lo`, advance `lo` to the highest float returned, and repeat until a page
|
||||
# is under the cap. (The old minPrice/maxPrice bisection silently truncated cheap
|
||||
# skins: >60 listings can share a sub-$0.02 reference band, which no price window can
|
||||
# split — floats almost never tie, so the cursor always makes progress.)
|
||||
PAGE = ("https://cs.money/market/buy/?search={search}"
|
||||
"&order=asc&sort=float&minFloat={lo:.12f}&maxFloat=1")
|
||||
PAGE_CAP = 60 # items per SSR page
|
||||
PAGE_PARAMS_RE = re.compile(
|
||||
r'<script\b[^>]*id="__page-params"[^>]*>(.*?)</script>', re.S)
|
||||
|
||||
|
||||
# --- IPRoyal residential proxy ----------------------------------------------------
|
||||
|
||||
def _new_session_id() -> str:
|
||||
"""Short, opaque, URL-safe token. IPRoyal pins one residential exit IP per
|
||||
distinct session value, so a fresh id == a fresh IP."""
|
||||
return uuid.uuid4().hex[:10]
|
||||
|
||||
|
||||
def _iproyal_password(session_id: str) -> str:
|
||||
"""Bake the targeting/session knobs onto the account password, IPRoyal-style:
|
||||
"<pass>_country-us_session-<id>_lifetime-60m". Country is optional."""
|
||||
pw = IPROYAL_PASSWORD
|
||||
if IPROYAL_COUNTRY:
|
||||
pw += f"_country-{IPROYAL_COUNTRY}"
|
||||
pw += f"_session-{session_id}_lifetime-{IPROYAL_LIFETIME_MIN}m"
|
||||
return pw
|
||||
|
||||
|
||||
class LocalForwardingProxy:
|
||||
"""In-process HTTP proxy on 127.0.0.1 that chains every connection to the IPRoyal
|
||||
gateway, injecting the Proxy-Authorization header itself. Chromium ignores creds in
|
||||
--proxy-server and the in-browser ways to answer the gateway's 407 (a CDP auth
|
||||
handler, or a disabled MV2 extension) are Cloudflare tells — so we terminate the
|
||||
browser->proxy hop locally and add auth here, leaving Chrome to talk to an auth-free
|
||||
endpoint at zero CDP. HTTPS (all cs.money serves) flows through the CONNECT tunnel,
|
||||
so this proxy only relays ciphertext and never sees plaintext. Ported from the .NET
|
||||
LocalForwardingProxy. The active session token can be swapped live (set_password) to
|
||||
move to a fresh exit IP without restarting the browser. (New tunnels pick up the new
|
||||
IP; any still-open keep-alive tunnel stays on the old one until it closes.)"""
|
||||
|
||||
def __init__(self, host: str, port: int, username: str, password: str):
|
||||
self._host = host
|
||||
self._port = port
|
||||
self._username = username
|
||||
self._password = password
|
||||
self._server: asyncio.AbstractServer | None = None
|
||||
self.endpoint = ""
|
||||
|
||||
def set_password(self, password: str) -> None:
|
||||
self._password = password
|
||||
|
||||
def _auth_header(self) -> str:
|
||||
token = base64.b64encode(f"{self._username}:{self._password}".encode()).decode()
|
||||
return f"Proxy-Authorization: Basic {token}\r\n"
|
||||
|
||||
async def start(self) -> "LocalForwardingProxy":
|
||||
self._server = await asyncio.start_server(self._handle, "127.0.0.1", 0)
|
||||
port = self._server.sockets[0].getsockname()[1]
|
||||
self.endpoint = f"127.0.0.1:{port}"
|
||||
return self
|
||||
|
||||
async def stop(self) -> None:
|
||||
if self._server is not None:
|
||||
self._server.close()
|
||||
try:
|
||||
await self._server.wait_closed()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
@staticmethod
|
||||
async def _read_header(reader: asyncio.StreamReader) -> str | None:
|
||||
"""Read up to the end of the HTTP header block (CRLFCRLF). None on EOF/overflow."""
|
||||
try:
|
||||
data = await reader.readuntil(b"\r\n\r\n")
|
||||
except (asyncio.IncompleteReadError, asyncio.LimitOverrunError):
|
||||
return None
|
||||
return data.decode("latin-1")
|
||||
|
||||
async def _handle(self, client_reader: asyncio.StreamReader, client_writer: asyncio.StreamWriter) -> None:
|
||||
up_writer: asyncio.StreamWriter | None = None
|
||||
try:
|
||||
header = await self._read_header(client_reader)
|
||||
if not header:
|
||||
return
|
||||
parts = header.split("\r\n", 1)[0].split(" ")
|
||||
if len(parts) < 2:
|
||||
return
|
||||
method, target = parts[0], parts[1]
|
||||
|
||||
up_reader, up_writer = await asyncio.open_connection(self._host, self._port)
|
||||
if method.upper() == "CONNECT":
|
||||
# HTTPS: open an authenticated tunnel upstream, then relay raw bytes.
|
||||
up_writer.write(
|
||||
f"CONNECT {target} HTTP/1.1\r\nHost: {target}\r\n{self._auth_header()}\r\n".encode())
|
||||
await up_writer.drain()
|
||||
up_header = await self._read_header(up_reader)
|
||||
status = up_header.split(" ", 2) if up_header else []
|
||||
if len(status) < 2 or status[1] != "200":
|
||||
line = (up_header or "no response").split("\r\n", 1)[0]
|
||||
print(f" proxy: upstream refused CONNECT {target}: {line}")
|
||||
client_writer.write(b"HTTP/1.1 502 Bad Gateway\r\nConnection: close\r\n\r\n")
|
||||
await client_writer.drain()
|
||||
return
|
||||
client_writer.write(b"HTTP/1.1 200 Connection established\r\n\r\n")
|
||||
await client_writer.drain()
|
||||
else:
|
||||
# Plain HTTP: re-inject the request upstream with auth, then relay.
|
||||
idx = header.index("\r\n") + 2
|
||||
up_writer.write((header[:idx] + self._auth_header() + header[idx:]).encode())
|
||||
await up_writer.drain()
|
||||
|
||||
await self._relay(client_reader, client_writer, up_reader, up_writer)
|
||||
except Exception:
|
||||
pass # one bad tunnel must never take down the listener
|
||||
finally:
|
||||
for w in (client_writer, up_writer):
|
||||
if w is not None:
|
||||
try:
|
||||
w.close()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
@staticmethod
|
||||
async def _relay(
|
||||
client_reader: asyncio.StreamReader, client_writer: asyncio.StreamWriter,
|
||||
up_reader: asyncio.StreamReader, up_writer: asyncio.StreamWriter) -> None:
|
||||
# Pipe both directions, but tear the whole tunnel down as soon as EITHER side
|
||||
# closes (mirrors the .NET WhenAny). Waiting for both — as a plain gather does —
|
||||
# leaks a task holding two sockets on every half-closed connection, which piles
|
||||
# up fast across a long multi-worker run. Closing both writers when the first
|
||||
# pipe finishes unblocks the other's pending read so both tasks settle.
|
||||
async def pipe(reader: asyncio.StreamReader, writer: asyncio.StreamWriter) -> None:
|
||||
try:
|
||||
while data := await reader.read(65536):
|
||||
writer.write(data)
|
||||
await writer.drain()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
a = asyncio.create_task(pipe(client_reader, up_writer))
|
||||
b = asyncio.create_task(pipe(up_reader, client_writer))
|
||||
try:
|
||||
await asyncio.wait({a, b}, return_when=asyncio.FIRST_COMPLETED)
|
||||
finally:
|
||||
for w in (client_writer, up_writer):
|
||||
try:
|
||||
w.close()
|
||||
except Exception:
|
||||
pass
|
||||
await asyncio.gather(a, b, return_exceptions=True)
|
||||
|
||||
|
||||
def looks_like_challenge(body: str) -> bool:
|
||||
s = (body or "").lstrip()
|
||||
return not s or s.startswith("<") or "Just a moment" in body or "challenge-platform" in body
|
||||
|
||||
|
||||
# --- C2 HTTP (stdlib, run off the event loop) -------------------------------------
|
||||
|
||||
def _get_job_sync():
|
||||
req = urllib.request.Request(f"{C2_URL}/jobs/next", headers={"X-Worker-Token": TOKEN})
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=15) as r:
|
||||
if r.status == 204:
|
||||
return None
|
||||
return json.loads(r.read() or b"null")
|
||||
except urllib.error.HTTPError as e:
|
||||
print(f" C2 /jobs/next -> HTTP {e.code}")
|
||||
return None
|
||||
except urllib.error.URLError as e:
|
||||
print(f" C2 unreachable: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def _post_result_sync(job_id: str, payload: dict):
|
||||
data = json.dumps(payload).encode()
|
||||
req = urllib.request.Request(
|
||||
f"{C2_URL}/jobs/{job_id}/result", data=data, method="POST",
|
||||
headers={"X-Worker-Token": TOKEN, "Content-Type": "application/json"})
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=60) as r:
|
||||
return json.loads(r.read() or b"null")
|
||||
except urllib.error.HTTPError as e:
|
||||
print(f" C2 result -> HTTP {e.code}: {e.read()[:200]!r}")
|
||||
return None
|
||||
except urllib.error.URLError as e:
|
||||
print(f" C2 unreachable posting result: {e}")
|
||||
return None
|
||||
|
||||
|
||||
async def get_job():
|
||||
return await asyncio.to_thread(_get_job_sync)
|
||||
|
||||
|
||||
async def post_result(job_id, payload):
|
||||
return await asyncio.to_thread(_post_result_sync, job_id, payload)
|
||||
|
||||
|
||||
# --- scraping ---------------------------------------------------------------------
|
||||
|
||||
async def fetch_json(page, url: str) -> tuple[str, str, int]:
|
||||
"""Fetch in-page and also read back the Resource Timing transferSize — the actual
|
||||
COMPRESSED bytes on the wire (what the metered proxy bills), not len(body) which is
|
||||
the decompressed size. Returns (status, body, wire_bytes); wire_bytes is -1 if the
|
||||
timing entry wasn't available. Same-origin (cs.money), so the size fields are exposed."""
|
||||
expr = (
|
||||
f"fetch({url!r}, {{credentials:'include', headers:{{'accept':'application/json'}}}})"
|
||||
f".then(async r => {{"
|
||||
f" const body = await r.text();"
|
||||
f" const e = performance.getEntriesByName({url!r}).slice(-1)[0];"
|
||||
f" return JSON.stringify({{status: r.status, body: body,"
|
||||
f" wire: e ? e.transferSize : -1, dec: e ? e.decodedBodySize : -1}});"
|
||||
f"}})"
|
||||
)
|
||||
raw = await page.evaluate(expr, await_promise=True)
|
||||
if not isinstance(raw, str):
|
||||
return ("-1", "", -1)
|
||||
try:
|
||||
obj = json.loads(raw)
|
||||
return (str(obj.get("status", "-1")), obj.get("body", ""), int(obj.get("wire", -1)))
|
||||
except (json.JSONDecodeError, ValueError, TypeError):
|
||||
return ("-1", raw, -1)
|
||||
|
||||
|
||||
async def _click(page, text, timeout=3):
|
||||
try:
|
||||
el = await page.find(text, best_match=True, timeout=timeout)
|
||||
if el:
|
||||
await el.click()
|
||||
return True
|
||||
except Exception:
|
||||
pass
|
||||
return False
|
||||
|
||||
|
||||
async def dismiss_consent(page):
|
||||
"""Privacy-preserving. The banner only offers 'Accept all' / 'Manage cookies';
|
||||
the Reject-all control lives inside the Manage window. So: Manage -> Reject all ->
|
||||
Confirm. (The data path reads SSR __page-params regardless, but this keeps the
|
||||
session honest and unblocks any future interaction.)"""
|
||||
steps = []
|
||||
if await _click(page, "Manage cookies") or await _click(page, "Manage"):
|
||||
await page.sleep(1)
|
||||
if await _click(page, "Reject all"):
|
||||
steps.append("reject-all")
|
||||
for c in ("Confirm my choice", "Confirm", "Save"):
|
||||
if await _click(page, c):
|
||||
steps.append(f"confirm:{c}")
|
||||
break
|
||||
return ", ".join(steps) if steps else None
|
||||
|
||||
|
||||
async def warm(page):
|
||||
"""Open the market and clear Cloudflare so the session holds cf_clearance."""
|
||||
print(f"Warming session at {MARKET_URL} (clear Cloudflare; {SOLVE_SECONDS}s)...")
|
||||
await page.get(MARKET_URL)
|
||||
await page.sleep(SOLVE_SECONDS)
|
||||
clicked = await dismiss_consent(page)
|
||||
print(f"Consent: {'dismissed via ' + clicked if clicked else 'left up'}")
|
||||
|
||||
|
||||
def extract_items(html: str) -> list:
|
||||
"""Pull inventory.items out of the page's __page-params JSON blob."""
|
||||
m = PAGE_PARAMS_RE.search(html)
|
||||
if not m:
|
||||
return []
|
||||
try:
|
||||
return json.loads(m.group(1)).get("inventory", {}).get("items", []) or []
|
||||
except json.JSONDecodeError:
|
||||
return []
|
||||
|
||||
|
||||
async def scrape_job(page, job) -> tuple[list, int, str, int]:
|
||||
"""Scrape ALL listings for one skin+wear via a forward float cursor.
|
||||
|
||||
A search page returns at most 60 items and ignores offset, but cs.money sorts by
|
||||
float (order=asc&sort=float) and filters by minFloat. So we walk the float axis:
|
||||
grab the 60 lowest-float items at/above `lo`, advance `lo` to the highest float on
|
||||
the page, and repeat until a page is under the cap. The boundary item is re-fetched
|
||||
(minFloat is inclusive) and dropped by the id dedup. Returns
|
||||
(items, fetches, reason, wire_bytes) where wire_bytes is the metered (compressed) cost.
|
||||
"""
|
||||
search = urllib.parse.quote_plus(job["search"])
|
||||
max_fetches = job.get("maxPages", 40) # safety cap on page fetches per job
|
||||
seen: dict = {}
|
||||
fetches = 0
|
||||
wire = 0
|
||||
lo = 0.0
|
||||
reason = "completed"
|
||||
|
||||
while fetches < max_fetches:
|
||||
status, body, wbytes = await fetch_json(page, PAGE.format(search=search, lo=lo))
|
||||
fetches += 1
|
||||
if wbytes > 0:
|
||||
wire += wbytes
|
||||
|
||||
if "Just a moment" in body or "challenge-platform" in body:
|
||||
return list(seen.values()), fetches, "challenged", wire
|
||||
|
||||
items = extract_items(body)
|
||||
floats = []
|
||||
for it in items:
|
||||
if it.get("id") is not None:
|
||||
seen[it["id"]] = it
|
||||
fl = (it.get("asset") or {}).get("float")
|
||||
if isinstance(fl, (int, float)):
|
||||
floats.append(fl)
|
||||
|
||||
if len(items) < PAGE_CAP:
|
||||
break # last page — fewer than the cap means we've seen everything
|
||||
|
||||
# Advance the cursor past the highest float on this page. Items at exactly that
|
||||
# float are re-fetched next round (minFloat is inclusive) and deduped by id.
|
||||
nxt = max(floats) if floats else None
|
||||
if nxt is None or nxt <= lo:
|
||||
# Cursor can't advance: >60 listings share a single float value, or the
|
||||
# items carry no float. Bail loudly rather than spin — a flagged gap beats
|
||||
# a silent one (this is the failure the price-window version hid).
|
||||
reason = "stuck-float-tie"
|
||||
break
|
||||
lo = nxt
|
||||
|
||||
await page.sleep(DELAY + random.uniform(0, JITTER))
|
||||
else:
|
||||
reason = "fetch-cap"
|
||||
|
||||
return list(seen.values()), fetches, reason, wire
|
||||
|
||||
|
||||
async def main():
|
||||
# IPRoyal (auth'd, per-worker sticky IP) takes priority; else a plain auth-free
|
||||
# PROXY; else this host's own IP. The forwarder injects IPRoyal auth so Chrome
|
||||
# only ever sees an auth-free 127.0.0.1 endpoint.
|
||||
forwarder = None
|
||||
session_id = None
|
||||
if IPROYAL_USERNAME and IPROYAL_PASSWORD:
|
||||
session_id = _new_session_id()
|
||||
forwarder = await LocalForwardingProxy(
|
||||
IPROYAL_HOST, IPROYAL_PORT, IPROYAL_USERNAME, _iproyal_password(session_id)).start()
|
||||
proxy = forwarder.endpoint
|
||||
proxy_label = f"iproyal[{IPROYAL_COUNTRY or 'any'}] session {session_id} via {forwarder.endpoint}"
|
||||
else:
|
||||
proxy = PROXY
|
||||
proxy_label = PROXY or "own IP"
|
||||
|
||||
args = [f"--proxy-server={proxy}"] if proxy else []
|
||||
if not LOAD_IMAGES:
|
||||
# Disable image loading at the engine level — the dominant bandwidth cost on
|
||||
# an image-heavy market, and unneeded for CF clearance or the JSON API.
|
||||
args.append("--blink-settings=imagesEnabled=false")
|
||||
if os.environ.get("CHROME_NO_SANDBOX") == "1":
|
||||
# Required when running Chromium as root in a container.
|
||||
args += ["--no-sandbox", "--disable-dev-shm-usage"]
|
||||
print(f"Starting worker (C2={C2_URL}, proxy={proxy_label}, images={'on' if LOAD_IMAGES else 'off'})...")
|
||||
browser = await uc.start(headless=False, browser_executable_path=BROWSER_PATH, browser_args=args)
|
||||
try:
|
||||
page = await browser.get("about:blank")
|
||||
await warm(page)
|
||||
|
||||
total_wire = 0 # metered (compressed) bytes this worker has pulled, lifetime
|
||||
while True:
|
||||
job = await get_job()
|
||||
if not job:
|
||||
await asyncio.sleep(IDLE_SECONDS)
|
||||
continue
|
||||
|
||||
print(f"Job {job['jobId'][:8]} — search {job['search']!r}")
|
||||
items, pages, reason, wire = await scrape_job(page, job)
|
||||
total_wire += wire
|
||||
|
||||
if reason == "challenged":
|
||||
# The exit IP is likely flagged. On IPRoyal, rotate to a fresh sticky
|
||||
# session (new IP) before re-warming; otherwise just re-solve in place.
|
||||
if forwarder is not None:
|
||||
session_id = _new_session_id()
|
||||
forwarder.set_password(_iproyal_password(session_id))
|
||||
print(f" challenged; rotating exit IP -> session {session_id}, re-warming...")
|
||||
else:
|
||||
print(" re-challenged; re-warming session...")
|
||||
await warm(page)
|
||||
|
||||
result = await post_result(job["jobId"], {
|
||||
"items": items, "pages": pages, "stoppedReason": reason})
|
||||
summary = (f"matched {result.get('matched')}, new {result.get('inserted')}, "
|
||||
f"upd {result.get('updated')}, removed {result.get('removed')}") if result else "post failed"
|
||||
wire_kb = wire / 1024
|
||||
print(f" scraped {len(items)} items ({pages}p, {reason}, {wire_kb:.0f}KB wire) "
|
||||
f"-> {summary} [lifetime {total_wire / 1_048_576:.1f}MB]")
|
||||
|
||||
await page.sleep(DELAY + random.uniform(0, JITTER))
|
||||
finally:
|
||||
browser.stop()
|
||||
if forwarder is not None:
|
||||
await forwarder.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uc.loop().run_until_complete(main())
|
||||
Reference in New Issue
Block a user