Resilience & Troubleshooting¶
PyWuzzuf is designed for reliable data collection in the face of network instability and anti-bot measures. This guide covers how the client handles failures automatically and how you can customize error recovery.
Layered Defense Strategy¶
PyWuzzuf uses a three-layer strategy to ensure reliability:
| Layer | Mechanism | Scope | Automatic? |
|---|---|---|---|
| Transport | Retries & Backoff | Single HTTP Requests | Yes |
| Application | Bot Detection Heuristic | Session State | Yes (Raises Error) |
| Pagination | Flow Control Signals | Data Collection Loop | Opt-in (Callbacks) |
1. Automatic HTTP Retries¶
The client automatically retries failed requests using exponential back-off.
- Triggers: Network errors (
RequestsError) and Rate Limits (HTTP 429). - Strategy: Exponential back-off with jitter (up to 5 attempts).
- Configuration: Currently hardcoded in
HttpCore. If you need custom retry policies, you can inject a pre-configuredcurl_cffi.AsyncSessionwith custom adapters.
You generally do not need to handle these errors yourself unless they persist after 5 retries.
2. Bot Detection Mitigation¶
If the Wuzzuf API suspects automated access, it will return HTTP 403 Forbidden. PyWuzzuf tracks consecutive 403 responses.
The Threshold¶
If 3 consecutive 403 responses occur, the client raises a BotDetectionError instead of retrying indefinitely.
Handling the Error¶
import asyncio
from pywuzzuf import WuzzufClient, BotDetectionError
async def main():
try:
async with WuzzufClient() as client: # (1)!
result = await client.jobs.search("Python").limit(10).all()
print(f"Found {len(result.items)} jobs.")
except BotDetectionError as e: # (2)!
print(f"Bot detection triggered: {e}")
if __name__ == "__main__":
asyncio.run(main())
- The client uses a default impersonation string (e.g., "chrome124"). If this fingerprint is flagged by Wuzzuf, requests will start failing.
- This exception is raised immediately after 3 consecutive 403s, preventing the script from hanging indefinitely on retries.
Resolution Strategies¶
-
Update Impersonation: The most common fix is to update the browser fingerprint.
# Try a newer Chrome version or a different browser async with WuzzufClient(impersonate="chrome124") as client: # (1)! # ... pass- Valid values include
"chrome110","chrome120","chrome124","safari15_5", etc. Updating this often bypasses outdated signature blocks.
- Valid values include
-
Reduce Request Frequency: Use
asyncio.sleepbetween large pagination calls. - Use Proxies: Pass a proxied
AsyncSessionto the client constructor.
3. Pagination Flow Control¶
For granular control over how errors are handled during large paginated fetches, use the on_error callback. This moves resilience from the "Request" level to the "Dataset" level.
Skipping Failed Pages¶
If you are fetching 1000+ jobs and a single page times out, you might prefer to skip it rather than fail the entire operation.
import asyncio
from pywuzzuf import WuzzufClient, PaginationSignal
async def main():
def handle_error(error, page, total_fetched): # (1)!
print(f"⚠️ Error on page {page}: {error}")
# If we hit a 404 or Timeout, skip the page
if "404" in str(error) or "Timeout" in str(error):
return PaginationSignal.CONTINUE # (2)!
# Stop on critical errors
return PaginationSignal.STOP
async with WuzzufClient() as client:
result = await client.jobs.search("Backend") \
.on_error(handle_error) \
.limit(500) \
.all()
print(f"Fetched {len(result.items)} items.")
if __name__ == "__main__":
asyncio.run(main())
- Context: The callback receives the exception, the page number where it occurred, and the total items fetched so far.
- Signal: Returning
CONTINUEtells the paginator to swallow the error, discard that specific page, and proceed to the next one.
4. Data Quality Checks¶
Even successful API responses can have missing or malformed fields. The EnrichedJob object includes a quality attribute to help you audit data.
import asyncio
from pywuzzuf import WuzzufClient
async def main():
async with WuzzufClient() as client:
result = await client.jobs.search("Data Scientist").limit(10).all()
for job in result.items:
if job.quality.has_anomalies: # (1)!
print(f"Job {job.id} has issues:")
print(f" Missing: {job.quality.missing_fields}")
print(f" Degraded: {job.quality.degraded_fields}")
if __name__ == "__main__":
asyncio.run(main())
- Aggregated Check: This property is
Trueif any fields are missing (failed enrichment) or degraded (ambiguous values).
This ensures your pipeline doesn't break silently when the API returns incomplete records.
Next Steps¶
You have now covered the full lifecycle of the client. Refer to the API Reference for detailed type information.