鍍金池/ 問答/Linux/ parallel并行命令原理問題

parallel并行命令原理問題

對于parallel這個工具的官網(wǎng)介紹中的一段話有點不理解:
For better parallelism GNU parallel can distribute the arguments between all the parallel jobs when end of file is met.

Below GNU parallel reads the last argument when generating the second job. When GNU parallel reads the last argument, it spreads all the arguments for the second job over 4 jobs instead, as 4 parallel jobs are requested.

The first job will be the same as the --xargs example above, but the second job will be split into 4 evenly sized jobs, resulting in a total of 5 jobs:

cat num30000 | parallel --jobs 4 -m echo | wc -l
Output (if you run this under Bash on GNU/Linux):

5
上面明明是分成4個job,為什么結(jié)果是5行?
其次是按照上面的說法是parallel會先讀完文件然后將文件內(nèi)容作為參數(shù)分配給各個job嗎?要是文件很大讀完文件再分配豈不是很費時間?譬如統(tǒng)計一個很大文件的行數(shù)的話,這樣先讀完文件再分配任務(wù)(僅僅是統(tǒng)計行數(shù))并行運算,應(yīng)該比直接wc -l花費時間更多吧?

更奇怪的是,在我的計算機上面運行結(jié)果是6行?

[10:01 sxuan@hulab ~]$ cat num30000 | parallel --jobs 4 -m echo | wc -l
6

謝謝!

回答
編輯回答
澐染

-m會把多行輸入當作參數(shù)傳給命令,而參數(shù)長度是有限的,所以會開多于4個進程進行處理。

> seq 1 30000 | parallel --jobs 4 -m echo | wc -l
5
> seq 1 100000 | parallel --jobs 4 -m echo | wc -l
8

可以用xargs --show-limits看到參數(shù)長度限制:

> xargs --show-limits
Your environment variables take up 892 bytes
POSIX upper limit on argument length (this system): 2094212
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2093320
2018年6月23日 19:02