Advanced Pagination¶
PyWuzzuf abstracts away the complexities of cursor management, rate-limit recovery, and state tracking. You can choose between "all-at-once" collection for convenience or item-by-item streaming for memory efficiency.
Two Modes of Operation¶
| Mode | Method | Use Case | Memory Usage |
|---|---|---|---|
| Eager | .all() |
Small/medium datasets (<10k jobs). Data analysis. | High (Buffers all items in RAM). |
| Streaming | async for |
Large datasets. Exporting to DB/CSV. | Constant (One page at a time). |
Eager Collection: .all()¶
The simplest way to fetch results. Returns a PaginationResult object containing the items and exhaustive metadata.
import asyncio
from pywuzzuf import WuzzufClient
async def main():
async with WuzzufClient() as client:
# Fetch up to 100 jobs
result = await client.jobs.search("Data Science").limit(100).all()
print(f"Fetched: {len(result.items)}")
print(f"Pages: {result.pages_fetched}")
print(f"Stopped Early: {result.terminated_early}")
if __name__ == "__main__":
asyncio.run(main())
Setting Hard Caps¶
To prevent runaway queries, always set a limit or page cap when using .all().
import asyncio
from pywuzzuf import WuzzufClient
async def main():
async with WuzzufClient() as client:
result = await client.jobs.search("Python") \
.limit(500) \ # (1)!
.max_pages(20) \ # (2)!
.all()
print(f"Total jobs fetched: {len(result.items)}")
if __name__ == "__main__":
asyncio.run(main())
- Item Cap: Stop fetching after 500 items are collected.
- Page Cap: Stop after 20 pages, even if fewer than 500 items were found. Useful as a safety net.
Memory Safety
If you call .all() without a limit, PyWuzzuf will attempt to fetch every matching job. For popular searches, this could crash your memory. A warning is logged if .all() is called without limits.
Streaming: async for¶
For large datasets, use the asynchronous iterator. This processes items as they arrive, keeping only one page in memory at a time.
import asyncio
from pywuzzuf import WuzzufClient
async def main():
async with WuzzufClient() as client:
async for job in client.jobs.search("DevOps").paginate():
# Process one job at a time
print(f"Processing: {job.attributes.title}")
if __name__ == "__main__":
asyncio.run(main())
This is ideal for:
- Exporting data to files or databases.
- Running in memory-constrained environments.
- Processing "infinite" streams until a condition is met.
Flow Control & Signals¶
You can intercept the pagination process to handle errors or stop execution using callbacks.
Error Handling: on_error¶
If a page fetch fails (e.g., network timeout), the default behavior is to raise an exception. You can change this by returning a PaginationSignal.
import asyncio
from pywuzzuf import WuzzufClient, PaginationSignal
async def main():
def handle_error(error, page, total_fetched):
print(f"Error on page {page}: {error}")
# If it's a 404, skip the page. Otherwise, stop.
if "404" in str(error):
return PaginationSignal.CONTINUE # (1)!
return PaginationSignal.STOP # (2)!
async with WuzzufClient() as client:
result = await client.jobs.search("Backend") \
.on_error(handle_error) \
.all()
print(f"Finished with {len(result.items)} jobs.")
if __name__ == "__main__":
asyncio.run(main())
- CONTINUE: Swallow the error, skip the failed page, and proceed to the next one.
- STOP: Gracefully stop pagination and return the items collected so far.
Manual Stopping: on_progress¶
You can also use the progress callback to stop iteration based on custom logic (e.g., "Stop if I found a job from company X").
import asyncio
from pywuzzuf import WuzzufClient, PaginationSignal
async def main():
def monitor_progress(total, page): # (1)!
print(f"Fetched {total} jobs (Page {page})...")
# Custom stop condition
if total > 1000:
print("Limit reached, stopping.")
return PaginationSignal.STOP # (2)!
async with WuzzufClient() as client:
async for job in client.jobs.search("Sales").on_progress(monitor_progress).paginate(): # (3)!
print(f"Job: {job.attributes.title}")
if __name__ == "__main__":
asyncio.run(main())
- Arguments: The callback receives
total(items fetched so far) andpage(current page number, 1-indexed). - Signal Return: Returning
STOPbreaks theasync forloop immediately. The generator will not yield any more items from that point. - Streaming Mode: Unlike
.all(), using.paginate()allows you to process items one by one in the loop body while the callback monitors the overall progress in the background.
Next Steps¶
Learn how to handle the inevitable network hiccups and API blocks you might encounter during pagination.