You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+62-171
Original file line number
Diff line number
Diff line change
@@ -22,15 +22,15 @@
22
22
23
23
---
24
24
25
-
Pyper is a generalized framework for concurrent data-processing, based on functional programming patterns. Used for 🌐 **Data Collection**, 🔀 **ETL systems**, and general-purpose 🛠️ **Python Scripting**
25
+
Pyper is a comprehensive framework for concurrent and parallel data-processing, based on functional programming patterns. Used for 🌐 **Data Collection**, 🔀 **ETL Systems**, and general-purpose 🛠️ **Python Scripting**
26
26
27
27
See the [Documentation](https://pyper-dev.github.io/pyper/)
28
28
29
29
Key features:
30
30
31
-
* 💡**Intuitive API**: Easy to learn, easy to think about. Implements clean abstractions to seamlessly unify threaded and asynchronous work.
31
+
* 💡**Intuitive API**: Easy to learn, easy to think about. Implements clean abstractions to seamlessly unify threaded, multiprocessed, and asynchronous work.
32
32
* 🚀 **Functional Paradigm**: Python functions are the building blocks of data pipelines. Let's you write clean, reusable code naturally.
33
-
* 🛡️ **Safety**: Hides the heavy lifting of underlying task creation and execution. No more worrying about race conditions, memory leaks, and thread-level error handling.
33
+
* 🛡️ **Safety**: Hides the heavy lifting of underlying task execution and resource clean-up. No more worrying about race conditions, memory leaks, or thread-level error handling.
34
34
* ⚡ **Efficiency**: Designed from the ground up for lazy execution, using queues, workers, and generators.
35
35
* ✨ **Pure Python**: Lightweight, with zero sub-dependencies.
36
36
@@ -46,235 +46,124 @@ Note that `python-pyper` is the [pypi](https://pypi.org/project/python-pyper) re
46
46
47
47
## Usage
48
48
49
+
In Pyper, the `task` decorator is used to transform functions into composable pipelines.
50
+
49
51
Let's simulate a pipeline that performs a series of transformations on some data.
50
52
51
53
```python
52
54
import asyncio
53
55
import time
54
-
from typing import AsyncIterable
55
56
56
57
from pyper import task
57
58
58
59
59
-
defstep1(limit: int):
60
-
"""Generate some data."""
60
+
defget_data(limit: int):
61
61
for i inrange(limit):
62
62
yield i
63
63
64
64
65
-
asyncdefstep2(data: int):
66
-
"""Simulate some asynchronous work."""
65
+
asyncdefstep1(data: int):
67
66
await asyncio.sleep(1)
68
-
print("Finished async sleep")
69
-
return data+1
67
+
print("Finished async wait", data)
68
+
return data
70
69
71
70
72
-
defstep3(data: int):
73
-
"""Simulate some IO-bound (non awaitable) work."""
Pyper provides an elegant abstraction of the concurrent execution of each function via `pyper.task`, allowing you to focus on building out the **logical** functions of your program.
98
-
99
-
In our pipeline:
100
-
101
-
*`task(step1)` generates 20 data values
102
-
103
-
*`task(step2, concurrency=20)` spins up 20 asynchronous workers, taking each value as input and returning an output
104
-
105
-
*`task(step3, concurrency=20)` spins up 20 threaded workers, taking each value as input and returning an output
106
-
107
-
The script therefore takes ~2 seconds to complete, as `step2` and `step3` in the pipeline only take the 1 second of sleep time, performed concurrently. If you'd like, experiment with tweaking the `limit` and `concurrency` values for yourself.
108
-
109
-
---
110
-
111
-
<detailsmarkdown="1">
112
-
<summary><u>What does the logic translate to in non-concurrent code?</u></summary>
113
-
114
-
<br>
115
-
116
-
Having defined the logical operations we want to perform on our data as functions, all we are doing is piping the output of one function to the input of another. In sequential code, this could look like:
Pyper uses the `|` (motivated by Unix's pipe operator) syntax as a representation of this input-output piping between tasks.
102
+
Pyper provides an elegant abstraction of the execution of each function via `pyper.task`, allowing you to focus on building out the **logical** functions of your program. In the `main` function:
139
103
140
-
</details>
104
+
*`pipeline` defines a function; this takes the parameters of its first task (`get_data`) and yields each output from its last task (`step3`)
105
+
* Tasks are piped together using the `|` operator (motivated by Unix's pipe operator) as a syntactic representation of passing inputs/outputs between tasks.
141
106
142
-
<detailsmarkdown="1">
143
-
<summary><u>What would the implementation look like without Pyper?</u></summary>
107
+
In the pipeline, we are executing three different types of work:
144
108
145
-
<br>
146
-
147
-
Concurrent programming in Python is notoriously difficult to get right. In a concurrent data pipeline, some challenges are:
148
-
149
-
* We want producers to concurrently execute tasks and send results to the next stage as soon as it's done processing
150
-
* We want consumers to lazily pick up output as soon as it's available from the previous stage
151
-
* We need to somehow unify the execution of threads and coroutines, without letting non-awaitable tasks clog up the event-loop
109
+
*`task(step1, workers=20)` spins up 20 `asyncio.Task`s to handle asynchronous IO-bound work
152
110
153
-
The basic approach to doing this is by using queues-- a simplified and very unabstracted implementation could be:
111
+
*`task(step2, workers=20)` spins up 20 `threads` to handle synchronous IO-bound work
154
112
155
-
```python
156
-
asyncdefpipeline(limit: int):
157
-
q1 = asyncio.Queue()
158
-
q2 = asyncio.Queue()
159
-
q3 = asyncio.Queue()
160
-
161
-
step2_concurrency=20
162
-
step3_concurrency=20
163
-
164
-
asyncdefworker1():
165
-
for data in step1(limit):
166
-
await q1.put(data)
167
-
for _ inrange(step2_concurrency):
168
-
await q1.put(None)
169
-
170
-
worker2s_finished =0
171
-
asyncdefworker2():
172
-
nonlocal worker2s_finished
173
-
whileTrue:
174
-
data =await q1.get()
175
-
if data isNone:
176
-
break
177
-
output =await step2(data)
178
-
await q2.put(output)
179
-
worker2s_finished +=1
180
-
if worker2s_finished == step2_concurrency:
181
-
for _ inrange(step3_concurrency):
182
-
await q2.put(None)
183
-
184
-
worker3s_finished =0
185
-
asyncdefworker3():
186
-
nonlocal worker3s_finished
187
-
loop = asyncio.get_running_loop()
188
-
whileTrue:
189
-
data =await q2.get()
190
-
if data isNone:
191
-
break
192
-
# Pyper uses a custom thread group handler instead of run_in_executor
*`task(step3, workers=20, multiprocess=True)` spins up 20 `processes` to handle synchronous CPU-bound work
216
114
115
+
`task` acts as one intuitive API for unifying the execution of each different type of function.
217
116
218
-
asyncdefmain():
219
-
await run(20) # takes ~2 seconds
220
-
```
117
+
Each task submits their outputs to the next task within the pipeline via queue-based data structures, which is the mechanism underpinning how concurrency and parallelism are achieved. See the [docs](https://pyper-dev.github.io/pyper/docs/UserGuide/BasicConcepts) for a breakdown of what a pipeline looks like under the hood.
221
118
222
-
This implementation achieves the basic desired concurrent data flow, but still lacks some quality-of-life features that Pyper takes care of, like error handling within threads.
223
-
224
-
Pyper handles the complexities of managing queues and workers, so that this code can be reduced to the two-line main function in the example above.
119
+
---
225
120
226
121
</details>
227
122
228
123
<detailsmarkdown="1">
229
-
<summary><u>Do I have to use <code>async</code>?</u></summary>
124
+
<summary><u>See a non-async example</u></summary>
230
125
231
126
<br>
232
127
233
-
No-- not every program is asynchronous, so Pyper pipelines are by default synchronous, as long as their tasks are defined as synchronous functions. For example:
128
+
Pyper pipelines are by default non-async, as long as their tasks are defined as synchronous functions. For example:
234
129
235
130
```python
236
131
import time
237
-
from typing import Iterable
238
132
239
133
from pyper import task
240
134
241
135
242
-
defstep1(limit: int):
136
+
defget_data(limit: int):
243
137
for i inrange(limit):
244
138
yield i
245
139
246
-
247
-
defstep2(data: int):
140
+
defstep1(data: int):
248
141
time.sleep(1)
249
-
return data +1
250
-
142
+
print("Finished sync wait", data)
143
+
return data
251
144
252
-
defstep3(data: int):
253
-
time.sleep(1)
254
-
return2* data -1
145
+
defstep2(data: int):
146
+
for i inrange(10_000_000):
147
+
_ = i*i
148
+
print("Finished heavy computation", data)
149
+
return data
255
150
256
151
257
-
defprint_sum(data: Iterable[int]):
152
+
defmain():
153
+
pipeline = task(get_data, branch=True) \
154
+
| task(step1, workers=20) \
155
+
| task(step2, workers=20, multiprocess=True)
258
156
total =0
259
-
for output indata:
157
+
for output inpipeline(limit=20):
260
158
total += output
261
-
print("Total ", total)
262
-
263
-
264
-
defmain():
265
-
run = task(step1) \
266
-
| task(step2, concurrency=20) \
267
-
| task(step3, concurrency=20) \
268
-
> print_sum
269
-
# Run synchronously
270
-
run(limit=20)
159
+
print("Total:", total)
271
160
272
161
273
162
if__name__=="__main__":
274
-
main()# takes ~2 seconds
163
+
main()
275
164
```
276
165
277
-
A pipeline consisting of _at least one asynchronous function_ becomes an `AsyncPipeline`, which exposes the same logical function, provided `async` and `await` syntax in all of the obvious places. This makes it effortless to unify synchronously defined and asynchronously defined functions where need be.
166
+
A pipeline consisting of _at least one asynchronous function_ becomes an `AsyncPipeline`, which exposes the same usage API, provided `async` and `await` syntax in the obvious places. This makes it effortless to combine synchronously defined and asynchronously defined functions where need be.
278
167
279
168
</details>
280
169
@@ -284,9 +173,11 @@ To explore more of Pyper's features, see some further [examples](https://pyper-d
284
173
285
174
## Dependencies
286
175
287
-
Pyper is implemented in pure Python, with no sub-dependencies. It relies heavily on the well-established built-in modules:
288
-
*[asyncio](https://docs.python.org/3/library/asyncio.html) for handling async-based concurrency
289
-
*[threading](https://docs.python.org/3/library/threading.html) for handling thread-based concurrency
176
+
Pyper is implemented in pure Python, with no sub-dependencies. It is built on top of the well-established built-in Python modules:
177
+
*[threading](https://docs.python.org/3/library/threading.html) for thread-based concurrency
178
+
*[multiprocessing](https://docs.python.org/3/library/multiprocessing.html) for parallelism
179
+
*[asyncio](https://docs.python.org/3/library/asyncio.html) for async-based concurrency
180
+
*[concurrent.futures](https://docs.python.org/3/library/concurrent.futures.html) for unifying threads, processes, and async code
0 commit comments