This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Just going through technical documentation, it seems like we are going to have another 'scale out, single threaded' architecture. This seems very legacy to me as its very similar to hadoop architecture aka add more nodes for workloads aka customer spends more money.
We need to make it future proof/scalable proof, have concurrency and parallelism built inside the software so it takes care of muli-thread operations rather than just single threads. We should also give them the opportunity to spin up a worker node for projects in the cloud (AWS, GCP, Azure) with any size and run the workloads there.
Apologies if the architectural documentation does a poor job explaining, but I was aiming to address the high-level structure and not necessarily the details of process threading or similar.
The execution layer that is running under Predictive Server has two big components - the JVM coordination layer, and individual python tasks. The JVM layer is properly multithreaded, and in many areas is operating as an event-based architecture.
The python layer is partially single-threaded at the moment. This is a limitation of python the language. There are components of the work that we do where it is not single-threaded - some types of scoring pipelines will utilize multiple cores by calling out to libraries that are implemented natively. However, this is only single thread per-user, per-task. We can and do run multiple python processes concurrently on the same hardware.
When we start scaling out the python subcomponents of predictive server, it will likely go in this order:
1. see where we can get to by vertically scaling the system & increasing the number of allocated executors running on each node. This already works today - you can run more than one python process on a box at a time. 2. scale out slow single-process python tasks to use multiple python processes, coordinated by the scala layer. This would more effectively provide multithreading for single tasks, but isn't always worth the coordination overhead. 3. add multi-node data support. This would likely require the utilization of a proper distributed compute mechanism such as Dask or Spark, and swapping out the current python dataframes for ones backed by the Koalas library or similar.
Feel free to reach out if you want more information.