We faced an issue that resource utilization of the subtask is skewed, while there is no data skew for our input data.
We have 8 data sources joined together with left join. The left table contains 1.6B records, one of the right tables is about 1B, with others about 0.1B. The join is in SQL and it produces query plan of leftInputSec with HasUniqKey and rightInputSpec with JoinKeyContainsUniqueKey all the way down. There is not too much difference of the cardinality for all the join keys. By statistics, the top 100 cardinality skew is [1:65K, 1:17K], which is insignificant among 1.6B records.
JOIN is CPU intensive and the CPU is saturated at the beginning of 1h and then it drops down to ~50%, while data in the Kafka sources are far from being consumed up. The Disk/Network utilization also drops down during that time. Considering JOIN is computational intensive, it should utilize as many resources as it can, while it seems not in our case and there are a lot of redundant resources without any apparent bottleneck.
Our backend is rocksdb with checkpoint disable by enlarging the checkpoint interval to be INF. Host is 32 x i3.2xlarge from AWS, each with 8 CPU, 61GB memory, 1.7TB nvme SSD.
The following are some related metrics. My questions are as follows:
1. Does the one busy subtask cause back pressure for all subtasks upstream, or only some of them?
2. If it only causes the back pressure of one/some of them, why can't others make progress?
1. Is there any hint on locating the bottleneck and overall optimization for this Job?
Figure-1. Subtask resource utilization skew
Figure-2. No data skew for incoming records
Figure-3. CPU utilization pattern of task manager
Figure-4. Kafka source lag max and task MGR CPU utilization
> 1. Does the one busy subtask cause back pressure for all subtasks upstream, or only some of them?In fact one busy subtask could only cause backpressure for those upstream subtasks who need send data to the busy subtask. In your case, those upstream subtasks would be in back pressure who need hash shuffle data to subtask 248 and 251. You could double check by clicking the adjacent upstream operator's back pressure tabs.
> If it only causes the back pressure of one/some of them, why can't others make progress?
In fact, those subtasks of adjacent upstream node who do need send data to hot busy downstream subtask could continue working. However, backpressure is propagated in the whole job pipeline until the most upstream source. Besides, things would be worse in your case because there are exists multiple hash join.
> Is there any hint on locating the bottleneck and overall optimization for this Job?
There is a very useful debug tool in 1.13.0, which could help your find out why subtask 248 and 251 is so busy. CPU FlameGraphs is supported in supported WebUI since 1.13.0, see.
Thank you for the prompt reply.
1. Indeed, it's showing that specific subtask is back pressured in the corresponding upstream subtasks;
2. The explanation makes sense that partition can be shared by subtasks, and that's amplified by multiple hash join phase;
3. I did the profiling with the flamegraph table feature, and did another round profiling with the async-profiler for that specific thread. The hot spot is located in rocksdb's seek due to the HasUniqKey introducing suboptimal matching, which is costly per thread.
Please see attachment for the profiling file and advise.
26001.html (532K) Download Attachment
I agree with your analyze results that the hot spot is rocksdb's seek because left side is HasUniqKey. It happens when join operator process input data from right side.
I doubt why only subtask 248 and 251 is hot is right side is skew. However right side is much smaller than left side, so it seems there is no data skew for incoming records from web UI.
You could doubt check whether this is the root cause by one of the following ways:
1. do a profile at the other subtask, compare it with the profile result with hot subtask
2. block the left side streaming for a while if you could, to see the incoming records from Web UI.
I hope it helps.
Kai Fu <[hidden email]> 于2021年6月2日周三 下午1:50写道：
Thank you for the suggestion, we optimized it by changing the way it joins in some BushJoin like way to reduce the length of the pipeline and by grouping perfect-match of join condition at the beginning as much as possible. Now we can notice the CPU resource is saturated in the cold start phase. So the problem is solved for the moment, I believe we need some other deep dive on the subtask skew issue given there is no apparent upstream data skew. Thank you for the help.
On Wed, Jun 2, 2021 at 6:55 PM JING ZHANG <[hidden email]> wrote:
|Free forum by Nabble||Edit this page|