A long time ago (in a galaxy far away?) we created multithreading and reaped the benefits of logical parallelism. Later, with multiple cores we added benefits from physical parallelism. But, for anyone who is written multithreaded and multicore programs the expected benefits are usually far greater than the reality. It turns out that on the front side there's a cost to breaking a program into pieces that can run independently (on threads or cores) and on the backside there is a cost to reassembling and re-synchronizing the partial solutions into a cogent whole. If the requisite partitioning and coordinated reassembly is a single thread of procedural code then that process will be the limiting factor defining the maximum possible speed up. If the administrative management of input and output data is 20% of the cycles used then no matter how fast you make the operations on the partitioned data the complete process can never be faster than five times the single threaded version. The administrative code dominates the latency of the overall program.
Of course, whenever we start to parallelize a program the first thing we look at is this administrative overhead. The efficiency of this administrative component is far more important than the efficiency of the parallel components because if you just throw in a few more cores the overall program will have less latency. Not the most elegant solution, but it is a practical, initial engineering approach.
One interesting thing about the problem space that we try to attack with parallel programming is that it is often inherently graph like in its structure. Imagine something like a weather simulation in which the region is broken down into cells which are recursively broken into ever smaller cells. The atmosphere is a volume and the natural way to divide and conquer is to use an octree graph: each queue of the volume is divided into eight smaller cubes. The tree structure defines how the volume is disassembled and reassembled and the depth of each node in the tree defines the order "in time" that must be respected for computation and reassembly. Essentially, framing the problem in a graph like context should allow the problem to be broken into many small pieces and associate an order based priority on when those pieces can be processed. So far there's nothing magical in this sort of approach, it is still a standard hierarchical decomposition with parent node bottlenecks all the way up.
Before I launch into the new paradigm we should revisit what the modern-day CPU has become. In the early days of the general purpose computer the CPU pretty much would only fetch data from memory and do some simple operation and write the result in memory. Today CPU chips have many auxiliary processes that we tend to forget about. For instance, memory is actively managed as it is moved into L1 and L2 memory caches, the CPU on its own figures out how to keep the the data it uses in the most speed efficient locations. Another example is that if the CPU has some spare cycles then multiple logical branches in the code are processed until it's clear which branch will be taken, then the unneeded branches simply discarded: the time used on the branch not taken does not appear in the overall code latency. These specialized hardware sections of the modern CPU are essentially mini parallelization events that reduce the process latency and which most of us never even think about (and many aren't even aware of).
But back to our graph like decomposition of the problem. What if another hardware acceleration feature was added to our CPU? What if each of the nodes of our graphical decomposition contained not only its relevant local data but also its "priority" in the scheme of things. Then, deep inside the CPU some autonomic circuitry could decide entirely on its own when to operate on the data and when and how to rejoin it to its parent? In that scenario it may be possible to release a "swarm" of very small lightweight processes that each operate within their small, individual subtask domain. Each swarm member could figure out the what, where, and when for its particular contribution toward the solution. This approach to the problem will lead to overall faster processing and more efficient parallelization. But, the biggest return on this approach is that it should dramatically reduce the time and lines of code necessary to write efficient parallel programs. Just as you would never dream of writing your own L1-L2 cash management software, we can hope that someday you won't torment over decomposing and reassembling your data so you can parallelize it.
This technique could also be applied to problems that are as sensitive to the structure of the graph but rather are sensitive to the weights that relate to nodes (e.g. neural nets). This would allow the CPU to decide on its own to work on the most important part of the problem (the heaviest weights) first. Much like the progressive loading of images on a webpage you can imagine a system like this giving you a quick "pretty good" solution, and if you give it a little more time the solution gets better.
This is not fanciful thought or science-fiction. You should follow this link to an actual project underway at MIT. The MIT article contains lots of interesting details and a relatively easy read. Among the things you'll learn are that experiments show speedups between 3X and 18X (in one case 75X!) speedups over programs that were parallelized by hand by experienced scientific programmers. And because this new CPU architecture manages so much of the complex disassembly, reassembly, synchronization these new swarm programs required about 1/10 the code.