公司的产品一直紧跟 .net core 3.0 preview 不断升级, 部署到 linux 服务器后, 偶尔会出现某个进程cpu占用100%.
  由于服务部署在云上, 不能使用远程调试; 在局域网内的linux 服务器 或 windows开发机上又不能重现这个问题, 联想到java的jstack, 很是羡慕啊. 想到.net core 已经出来这么久了, 还是试着找找看吧, 结果还真找到一篇博客introducing diagnostics improvements in .net core 3.0

 

  这篇文章介绍了3个工具

  • dotnet-counters: 实时统计runtime的状况, 包括 cpu、内存、gc、异常等
  • dotnet-trace: 类似性能探测器
  • dotnet-dump: 程序崩溃时使用该工具

  这次使用的是dotnet-dump, 即使程序没有崩溃, 也可以dump程序快照, 用于分析

 

实验环境

ubuntu-16.04.5-desktop-amd64
sdk 3.0.100-preview6-012264

 

1. 新建一个简单console程序(只能是 .net core 3.0的程序, 不支持 .net core 2.2), 模拟cpu占用100%的情况

mkdir netcoredumptest && cd netcoredumptest
dotnet new console

编辑program.cs

namespace netcoredumptest
{
    using system;
    using system.threading.tasks;
    class program
    {
        static void main(string[] args)
        {
            task.factory.startnew(() => printnumber("print", 5));
            console.writeline("press any key to exit.");
            console.readkey();
        }

        static void printnumber(string message, int startnumber)
        {
            var number = startnumber;
            while (true)
                console.writeline($"{message} {number++}");
        }
    }
}

 

2. 安装dotnet-dump

dotnet tool install --global dotnet-dump --version 1.0.4-preview6.19311.1

提示

if you are using bash, you can add it to your profile by running the following command:

cat << \eof >> ~/.bash_profile
# add .net core sdk tools
export path="$path:/home/****/.dotnet/tools"
eof

you can add it to the current session by running the following command:

export path="$path:/home/****/.dotnet/tools"

you can invoke the tool using the following command: dotnet-dump
tool 'dotnet-dump' (version '1.0.4-preview6.19311.1') was successfully installed.

建议将 $home/.dotnet/tools加入到path, 好吧, 照着做吧, 记得使用下面的命令使设置立即生效

source ~/.bash_profile

 

3. 使用 dotnet netcoredumptest.dll 启动我们的问题程序, 然后使用  ps -ef | grep dotnet  查看程序的进程id, 可以看到进程id是 3411

ps -ef | grep dotnet
z*****e 3411 1464 22 07:51 pts/8 00:00:59 dotnet netcoredumptest.dll z*****e 3431 2935 0 07:55 pts/9 00:00:00 grep --color=auto dotnet

针对进程3411, 我们还需要知道是哪个线程占cpu, 使用 top -hp 3411 可以列出所有线程, 由于top每隔3秒刷新一次, 所以可能需要多观察几秒才能看到具体是哪个线程占用cpu比较高, 这里我们可以看到是pid=3418的线程(linux的进程id和线程id请自行了解一下).

top -hp 3411
   pid user      pr  ni    virt    res    shr s %cpu %mem     time+ command
  3418 z*****e    20   0 2997700  29060  22400 r 10.3  1.4   0:20.68 dotnet
  3411 z*****e    20   0 2997700  29060  22400 s  0.0  1.4   0:00.11 dotnet
  3412 z*****e    20   0 2997700  29060  22400 s  0.0  1.4   0:00.02 dotnet
  3413 z*****e    20   0 2997700  29060  22400 s  0.0  1.4   0:00.00 dotnet
  3414 z*****e    20   0 2997700  29060  22400 s  0.0  1.4   0:00.00 dotnet
  3415 z*****e    20   0 2997700  29060  22400 s  0.0  1.4   0:00.01 dotnet
  3416 z*****e    20   0 2997700  29060  22400 s  0.0  1.4   0:00.00 dotnet
  3417 z*****e    20   0 2997700  29060  22400 s  0.0  1.4   0:00.00 dotnet
  3421 z*****e    20   0 2997700  29060  22400 s  0.0  1.4   0:00.00 dotnet

 

获取dump, 只能正对进程进行dump, 所以我们输入的是 3411

dotnet-dump collect -p 3411
writing minidump with heap to /tmp/core_20190623_075649
complete

 

4. 分析

dotnet-dump analyze core_20190623_075649

使用clrthreads 查看所有线程

>clrthreads
threadcount:      4
unstartedthread:  0
backgroundthread: 3
pendingthread:    0
deadthread:       0
hosted runtime:   no
                                                                                                        lock
 dbg   id osid threadobj           state gc mode     gc alloc context                  domain           count apt exception
   0    1  d53 0000000001307d80    20020 preemptive  0000000000000000:0000000000000000 0000000001306450 1     ukn
   4    2  d57 000000000135bbd0    21220 preemptive  0000000000000000:0000000000000000 0000000001306450 0     ukn (finalizer)
   6    3  d59 00007f666c0009f0  1020220 preemptive  0000000000000000:0000000000000000 0000000001306450 0     ukn (threadpool worker)
   7    4  d5a 000000000130da40  1021220 preemptive  00007f6678106860:00007f6678106f20 0000000001306450 1     ukn (threadpool worker)

我们关心的线程3418的16进制是d5a, 也就是最后一行, 它的dbg是7, 我们需要使用 setthread 7, 将其设置为  当前操作的线程

然后使用 clrstack 获取线程调用信息

> setthread 7
> clrstack
os thread id: 0xd5a (7)
        child sp               ip call site
00007f6715561558 00007f671a2bd4bd [inlinedcallframe: 00007f6715561558] interop+sys.write(system.runtime.interopservices.safehandle, byte*, int32)
00007f6715561558 00007f669f669a9e [inlinedcallframe: 00007f6715561558] interop+sys.write(system.runtime.interopservices.safehandle, byte*, int32)
00007f6715561540 00007f669f669a9e ilstubclass.il_stub_pinvoke
00007f67155615e0 00007f669f67333e system.consolepal.write(microsoft.win32.safehandles.safefilehandle, byte*, int32, boolean)
00007f67155616a0 00007f669f67360c system.consolepal.write(microsoft.win32.safehandles.safefilehandle, byte[], int32, int32, boolean) [/_/src/system.console/src/system/consolepal.unix.cs @ 1236]
00007f67155616c0 00007f669f672b2a system.io.streamwriter.flush(boolean, boolean) [/_/src/system.private.corelib/shared/system/io/streamwriter.cs @ 261]
00007f6715561710 00007f669f6729f3 system.io.streamwriter.writeline(system.string) [/_/src/system.private.corelib/shared/system/io/streamwriter.cs @ 474]
00007f6715561760 00007f669f6727d3 system.io.textwriter+synctextwriter.writeline(system.string) [/_/src/system.private.corelib/shared/system/io/textwriter.cs @ 891]
00007f67155617a0 00007f669f672770 system.console.writeline(system.string) [/_/src/system.console/src/system/console.cs @ 550]
00007f67155617c0 00007f669f663791 netcoredumptest.program.printnumber(system.string, int32) [/home/zhouke/netcoredumptest/program.cs @ 18]
00007f6715561800 00007f669f6636d9 netcoredumptest.program+<>c.<main>b__0_0()
00007f6715561820 00007f669f1872a1 system.threading.tasks.task.innerinvoke() [/_/src/system.private.corelib/shared/system/threading/tasks/task.cs @ 2466]
00007f6715561840 00007f669f18cbc2 system.threading.tasks.task+<>c.<.cctor>b__274_0(system.object) [/_/src/system.private.corelib/shared/system/threading/tasks/task.cs @ 2445]
00007f6715561850 00007f669f171af2 system.threading.executioncontext.runfromthreadpooldispatchloop(system.threading.thread, system.threading.executioncontext, system.threading.contextcallback, system.object) [/_/src/system.private.corelib/shared/system/threading/executioncontext.cs @ 289]
00007f6715561890 00007f669f187111 system.threading.tasks.task.executewiththreadlocal(system.threading.tasks.task byref, system.threading.thread) [/_/src/system.private.corelib/shared/system/threading/tasks/task.cs @ 2406]
00007f6715561910 00007f669f186f28 system.threading.tasks.task.executeentryunsafe(system.threading.thread) [/_/src/system.private.corelib/shared/system/threading/tasks/task.cs @ 2344]
00007f6715561930 00007f669f186ebb system.threading.tasks.task.executefromthreadpool(system.threading.thread)
00007f6715561940 00007f669f17b754 system.threading.threadpoolworkqueue.dispatch() [/_/src/system.private.corelib/shared/system/threading/threadpool.cs @ 663]
00007f67155619c0 00007f669f169a5b system.threading._threadpoolwaitcallback.performwaitcallback() [/_/src/system.private.corelib/src/system/threading/threadpool.coreclr.cs @ 29]
00007f6715561d50 00007f6718a1ccaf [debuggeru2mcatchhandlerframe: 00007f6715561d50]

 

哗啦啦一大片, 有点java调用堆栈的味道, 不过我们还是找到了我们的问题代码

netcoredumptest.program.printnumber(system.string, int32)

 

有时候我们想知道传入的什么参数导致cpu占用高, 可以给clrstack加上参数 -a

> clrstack -a
..............
00007f0dd6ffc7c0 00007f0d6eef3791 netcoredumptest.program.printnumber(system.string, int32) [/home/zhouke/netcoredumptest/program.cs @ 18]
    parameters:
        message (0x00007f0dd6ffc7e8) = 0x00007f0d4800b8b0
        startnumber (0x00007f0dd6ffc7e4) = 0x0000000000000005
    locals:
        0x00007f0dd6ffc7e0 = 0x000000000014e42b
        0x00007f0dd6ffc7dc = 0x0000000000000001
...............

可以看到parameters里, startnumber作为值类型, 可以直接看到数值为5, 而message是引用类型, 指向0x00007f0d4800b8b0, 这时候需要用到 dumpobj 命令

> dumpobj 0x00007f0d4800b8b0
name:        system.string
methodtable: 00007f0d6ef70f90
eeclass:     00007f0d6eede1c0
size:        32(0x20) bytes
file:        /home/zhouke/dotnet/shared/microsoft.netcore.app/3.0.0-preview6-27804-01/system.private.corelib.dll
string:      print
fields:
              mt    field   offset                 type vt     attr            value name
00007f0d6ef6a138  400022b        8         system.int32  1 instance                5 _stringlength
00007f0d6ef66f38  400022c        c          system.char  1 instance               50 _firstchar
00007f0d6ef70f90  400022d      108        system.string  0   static 00007f0d47fff360 empty

好了, 可以看到它是一个字符串, 内容为 “print”

假如message是一个复杂类型, 可以查看fields下面的信息进一步查看

clrstack 还有一个实验性质的参数 -i, 协助查看各种变量信息, 需要用到lldb, 按照官方教程, 我暂时没有实验成功.

 

查看进程id和线程id, 更方便的方法是 htop(需要安装), 然后按 f4 进行过滤, 输入dotnet 即可

这张图是重新运行问题程序的结果, 进程id和线程id与前面不一样

第二行白色的是进程id=1650, 第一行cpu占用高, 是问题线程id=1658

 

 

 

end