# 3.4 Attach常见的坑

# 不同版本JDK在Attach成功后返回结果差异性

  • 现象

当使用JDK11去attach JDK8应用时,会抛异常com.sun.tools.attach.AgentLoadException: 0 , 但实际上已经attach成功了。异常结果如下:

Start arthas failed, exception stack trace: com.sun.tools.attach.AgentLoadException: 0
 at jdk.attach/sun.tools.attach.HotSpotVirtualMachine.loadAgentLibrary(HotSpotVirtualMachine.java:108)
 at jdk.attach/sun.tools.attach.HotSpotVirtualMachine.loadAgentLibrary(HotSpotVirtualMachine.java:119)
 at jdk.attach/sun.tools.attach.HotSpotVirtualMachine.loadAgent(HotSpotVirtualMachine.java:147)
1
2
3
4
  • 原因

在不同的JDK中HotSpotVirtualMachine#loadAgentLibrary方法的返回值不一样 ,在JDK8中返回0表示attach成功。

代码位置:src/share/classes/sun/tools/attach/HotSpotVirtualMachine.java

private void loadAgentLibrary(String agentLibrary, boolean isAbsolute, String options)
    throws AgentLoadException, AgentInitializationException, IOException
{
    InputStream in = execute("load",
                             agentLibrary,
                             isAbsolute ? "true" : "false",
                             options);
    try {
        // 返回0表示attach成功
        int result = readInt(in);
        if (result != 0) {
            throw new AgentInitializationException("Agent_OnAttach failed", result);
        }
    } finally {
        in.close();

    }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

JDK11返回的是"return code: 0"表示attach成功。

// 代码位置:src/jdk.attach/share/classes/sun/tools/attach/HotSpotVirtualMachine.java
private void loadAgentLibrary(String agentLibrary, boolean isAbsolute, String options) 
    throws AgentLoadException, AgentInitializationException, IOException 
{   
    // 返回结果
    String msgPrefix = "return code: "; 
    InputStream in = execute("load", 
                             agentLibrary, 
                             isAbsolute ? "true" : "false", 
                             options); 
    try (BufferedReader reader = new BufferedReader(new InputStreamReader(in))) { 
        String result = reader.readLine(); 
        if (result == null) { 
            throw new AgentLoadException("Target VM did not respond"); 
        } else if (result.startsWith(msgPrefix)) { 
            int retCode = Integer.parseInt(result.substring(msgPrefix.length())); 
            // "return code: 0" 表示attach成功 
            if (retCode != 0) { 
                throw new AgentInitializationException("Agent_OnAttach failed", retCode); 
            } 
        } else { 
            throw new AgentLoadException(result); 
        } 
    } 
} 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
  • 方案

发起Attach的进程需要兼容不同版本JDK返回结果。下面是arthas诊断工具对这个问题的兼容性处理方案:

代码位置:arthas/core/src/main/java/com/taobao/arthas/core/Arthas.java

try {
    virtualMachine.loadAgent(arthasAgentPath,
            configure.getArthasCore() + ";" + configure.toString());
} catch (IOException e) {
    // 处理返回值为 "return code: 0"
    if (e.getMessage() != null && e.getMessage().contains("Non-numeric value found")) {
        AnsiLog.warn(e);
        AnsiLog.warn("It seems to use the lower version of JDK to attach the higher version of JDK.");
        AnsiLog.warn(
                "This error message can be ignored, the attach may have been successful, and it will still try to connect.");
    } else {
        throw e;
    }
} catch (com.sun.tools.attach.AgentLoadException ex) {
    // 处理返回值为 "0"   
    if ("0".equals(ex.getMessage())) {
        // https://stackoverflow.com/a/54454418
        AnsiLog.warn(ex);
        AnsiLog.warn("It seems to use the higher version of JDK to attach the lower version of JDK.");
        AnsiLog.warn(   
                "This error message can be ignored, the attach may have been successful, and it will still try to connect.");
    } else {
        throw ex;
    }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

上面的代码可以看出,在Attach抛出异常后,对异常进行分类处理,当抛出IOException并且异常的message中有"Non-numeric value found",表示该异常是由于低版本Attach API attach 到高版本JDK上; 当抛出的异常是AgentLoadException并且message的值为"0"时,表示该异常是由于高版本Attach API attach 到低版本JDK导致。

# .java_pid文件被删除

  • 现象

当执行attach命令如jstack时,出现报错Unable to open socket file: target process not responding or HotSpot VM not loaded。错误如下所示:

MacBook-Pro admin$ jstack 33000
33000: Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding
1
2
3

并且/tmp目录下没有attach通讯的.java_pid文件。

MacBook-Pro admin$ ls .java_pid3000
ls: .java_pid3000: No such file or directory
1
2

然而,重启Java进程之后又可以使用jstack等attach工具了

  • 原因

很不幸,这是一个JDK的bug,原因是JVM在首次被attach时会创建.java_pid用于socket通信, 文件/tmp目录下(不同操作系统tmp目录位置不同,Linux 系统为/tmp 目录),该目录不可以被参数修改。 在Attach listener初始化过程中,这个文件首次被创建后,JVM会标记Attach Listener为initialized状态, 如果文件被删除了,这个Java进程无法被Attach。

  • 方案

    对于JDK8来说,只能重启进程;社区的讨论以及官方修复;

官方修复的pr给Attach Listener增加了INITIALIZING、NOT_INITIALIZED、INITIALIZED多种状态,并且在INITIALIZED状态下通过AttachListener::check_socket_file进行自检,如果发现文件不存在,会清理之前的listener,并重新建立。

修复代码如下,在代码行号为17处,对.attach_pid文件进行检测。

// Attempt to transit state to AL_INITIALIZING.
AttachListenerState cur_state = AttachListener::transit_state(AL_INITIALIZING, AL_NOT_INITIALIZED);
if (cur_state == AL_INITIALIZING) {
 // Attach Listener has been started to initialize. Ignore this signal.
  continue;
} else if (cur_state == AL_NOT_INITIALIZED) {
  // Start to initialize.
  if (AttachListener::is_init_trigger()) {
     // Attach Listener has been initialized.
     // Accept subsequent request.
      continue;
  } else {
     // Attach Listener could not be started.
     // So we need to transit the state to AL_NOT_INITIALIZED.
     AttachListener::set_state(AL_NOT_INITIALIZED);
  }
} else if (AttachListener::check_socket_file()) {
  // .attach_pid文件进行检测
  // Attach Listener has been started, but unix domain socket file
  // does not exist. So restart Attach Listener.
  continue;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

需要说明的是,该修复仅限JDK11以上版本。

# attach进程的权限问题

  • 现象

如果在root用户下执行jstack,而目标JVM进程不是root权限启动,执行报错如下:

Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding
1
2
  • 原因

下面是在JDK8上LinuxAttachListener线程接受命令的过程。在代码26行处会严格校验发起attach进程的uid和gid是否与目标JVM 一致。

代码位置:jdk8/src/hotspot/os/linux/vm/attachListener_linux.cpp


LinuxAttachOperation* LinuxAttachListener::dequeue() {
  for (;;) {
    int s;

    // wait for client to connect
    struct sockaddr addr;
    socklen_t len = sizeof(addr);
    RESTARTABLE(::accept(listener(), &addr, &len), s);
    if (s == -1) {
      return NULL;      // log a warning?
    }

    // get the credentials of the peer and check the effective uid/guid
    // - check with jeff on this.
    struct ucred cred_info;
    socklen_t optlen = sizeof(cred_info);
    if (::getsockopt(s, SOL_SOCKET, SO_PEERCRED, (void*)&cred_info, &optlen) == -1) {
      ::close(s);
      continue;
    }
    uid_t euid = geteuid();
    gid_t egid = getegid();
  
    // 严格校验 uid、gid
    if (cred_info.uid != euid || cred_info.gid != egid) {
      ::close(s);
      continue;
    }

    // peer credential look okay so we read the request
    LinuxAttachOperation* op = read_request(s);
    if (op == NULL) {
      ::close(s);
      continue;
    } else {
      return op;
    }
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

原则是上root权限不应该受到限制,因此JDK11对这个"不太合理"的限制做了解除,可以使用root权限attach任意用户启动的Java进程。

代码位置:jdk11/src/hotspot/os/linux/attachListener_linux.cpp

LinuxAttachOperation* LinuxAttachListener::dequeue() {
  for (;;) {
    int s;

    // wait for client to connect
    struct sockaddr addr;
    socklen_t len = sizeof(addr);
    RESTARTABLE(::accept(listener(), &addr, &len), s);
    if (s == -1) {
      return NULL;      // log a warning?
    }

    // get the credentials of the peer and check the effective uid/guid
    struct ucred cred_info;
    socklen_t optlen = sizeof(cred_info);
    if (::getsockopt(s, SOL_SOCKET, SO_PEERCRED, (void*)&cred_info, &optlen) == -1) {
      log_debug(attach)("Failed to get socket option SO_PEERCRED");
      ::close(s);
      continue;
    }
    // 允许root权限attach
    if (!os::Posix::matches_effective_uid_and_gid_or_root(cred_info.uid, cred_info.gid)) {
      log_debug(attach)("euid/egid check failed (%d/%d vs %d/%d)",
              cred_info.uid, cred_info.gid, geteuid(), getegid());
      ::close(s);
      continue;
    }

    // peer credential look okay so we read the request
    LinuxAttachOperation* op = read_request(s);
    if (op == NULL) {
      ::close(s);
      continue;
    } else {
      return op;
    }
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

matches_effective_uid_and_gid_or_root 的实现如下:

代码位置:jdk11/src/hotspot/os/linux/attachListener_linux.cpp

bool os::Posix::matches_effective_uid_and_gid_or_root(uid_t uid, gid_t gid) {
    return is_root(uid) || (geteuid() == uid && getegid() == gid);
}
1
2
3
  • 解决方案

切换到与用户相同权限执行然后再执行Attach。 在介绍jattach工具时已经对这部分代码做了详细分析,这里不在赘述。

# com.sun.tools.attach.AttachNotSupportedException: no providers installed

  • 原因以及解决方案 是因为引用的tools.jar包有问题,应该这样引用tools.jar
<dependency>
	<groupId>com.sun</groupId>
	<artifactId>tools</artifactId>
	<version>1.5.0</version>
	<scope>system</scope>
	<systemPath>/path/to/your/jdk/lib/tools.jar</systemPath>
</dependency>
1
2
3
4
5
6
7

systemPath标签用来指定本地的tools.jar位置,可以把tools.jar的绝对路径配置成相对路径:

<dependency>
	<groupId>com.sun</groupId>
	<artifactId>tools</artifactId>
	<version>1.5.0</version>
	<scope>system</scope>
	<systemPath>${env.JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
1
2
3
4
5
6
7