一、前言
elasticsearch官方提供了多种安装包来进行安装的方式,选择不同的安装包进行安装,根据各自的特点,差异巨大。也是由于社区针对不同的使用场景,对安装包进行了定制,顾名思义就是安装包中的目录结构、脚本文件、引用的模块等等都会有些许差别,这就导致了最终安装后能够满足安装包各自的使用场景和特点,本身这个设计没有任何问题。
二、背景
在一些特殊的使用场景下,或者有喜欢自己捣鼓的同学,偶尔也有自己的一些玩儿法,这样就会碰到一些看起来很奇怪的问题。
例如,我希望使用官方提供的TAR.GZ包进行安装,同时参考RPM包中提供的相关配置文件,自己生成systemd配置文件,通过systemctl进行elasticsearch.service的启动。安装路径可以自己指定,这样便于管理文件。
正是因为这个想法,才引发了本文描述的内容。
elasticsearch版本:7.9.1
2.1 elasticsearch的TAR包和RPM包对比
使用官方提供的TAR.GZ包进行安装,会把所有elasticsearch的文件都安装到同一个目录下,没有其他的文件,解压出来的就是全部,默认需要手动启动。
RPM包使用通常使用rpm -ivh ***.rpm进行安装,通过对官方rpm包的分析,粗浅了解到它会在系统的不同目录下,按照rpm包中的规划,对应安装不同的配置文件,可执行文件,日志目录,系统配置文件等等。例如在linux下,就会将上述文件对应安装到相应的目录中。
TAR.GZ包安装的一些特点列举如下:
- 所有的文件都放在同一个目录下
- 不同文件之间的引用和依赖关系都是通过相对路径来生效的
- 默认需要通过执行
bin/elasticsearch来启动程序,可以通过-d将程序启动为daemon模式等 - 不会安装其他的文件,解压即可使用
RPM包的一些特点:
- 不同的文件各自拷贝到对应的目录中去
- 文件之间的引用和依赖关系,有的直接通过绝对路径进行指定(配合rpm安装对应的路径),可以安装以后进行手动修改
- 可以通过
systemctl start elasticsearch进行启动 - rpm包中会携带系统相关的配置文件,例如
sysctl.d/内核参数配置文件等
举例来说,使用tar包安装以后,安装目录中的文件变量引用关系都是通过相对路径来寻找的,例如,通过jvm.options这个配置文件的内容可以清楚的看到差异。具体如下:
通过tar包解压出来的jvm.options文件:
## JVM configuration
################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms1g
-Xmx1g
################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################
## GC configuration
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly
## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
# 10-13:-XX:-UseConcMarkSweepGC
# 10-13:-XX:-UseCMSInitiatingOccupancyOnly
14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30
## JVM temporary directory
-Djava.io.tmpdir=${ES_TMPDIR}
## heap dumps
# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError
# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=data
# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=logs/hs_err_pid%p.log
## JDK 8 GC logging
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m
# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
通过rpm包安装,在/etc/elasticsearch/jvm.options对应的内容:
## JVM configuration
################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms1g
-Xmx1g
################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################
## GC configuration
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly
## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
# 10-13:-XX:-UseConcMarkSweepGC
# 10-13:-XX:-UseCMSInitiatingOccupancyOnly
14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30
## JVM temporary directory
-Djava.io.tmpdir=${ES_TMPDIR}
## heap dumps
# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError
# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=/var/lib/elasticsearch
# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
## JDK 8 GC logging
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m
# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
不难发现,其中几处日志路径的地址都是不一样的,tar包中的直接采用了相对路径,rpm包中使用的则是绝对路径。(例如最后一行)
2.2 使用tar包安装后使用systemd启动遇到问题
在修复好各种环境变量问题和对应关系后,启动会一直卡住,例如:
systemctl start elasticsearch
执行后,就一直无法退出。
三、解决方案
3.1 手动添加systemd模块相关文件到modules目录
第一步:将systemd模块从rpm包解压出来的modules目录中取出,放到tar包对应的modules目录里。
第二步:修改文件和文件夹的权限(systemd默认elasticsearch用户启动)
3.2 修改bin目录下elasticsearch-env中ES_DISTRIBUTION_TYPE字段值
第一步:定位bin/elasticsearch-env文件,找到ES_DISTRIBUTION_TYPE关键字,可以看到,tar包中,该关键字的值为tar,将它修改为rpm
四、问题原理
4.1 elasticsearch使用systemd的启动机制
systemd启动的service文件是通过解压官方rpm包后,从里面取出来加以修改利用的。官方的rpm包解压后,目录结构如下:
.
├── etc
│ ├── elasticsearch
│ │ ├── elasticsearch.yml
│ │ ├── jvm.options
│ │ ├── jvm.options.d
│ │ ├── log4j2.properties
│ │ ├── role_mapping.yml
│ │ ├── roles.yml
│ │ ├── users
│ │ └── users_roles
│ ├── init.d
│ │ └── elasticsearch
│ └── sysconfig
│ └── elasticsearch
├── usr
│ ├── lib
│ │ ├── sysctl.d
│ │ │ └── elasticsearch.conf
│ │ ├── systemd
│ │ │ └── system
│ │ └── tmpfiles.d
│ │ └── elasticsearch.conf
│ └── share
│ └── elasticsearch
│ ├── bin
│ ├── jdk
│ ├── lib
│ ├── LICENSE.txt
│ ├── modules
│ ├── NOTICE.txt
│ ├── plugins
│ └── README.asciidoc
└── var
├── lib
│ └── elasticsearch
└── log
└── elasticsearch
23 directories, 15 files
取出其中的elasticsearch.service文件内容
cat usr/lib/systemd/system/elasticsearch.service
内容如下:
[Unit]
Description=Elasticsearch
Documentation=https://www.elastic.co
Wants=network-online.target
After=network-online.target
[Service]
Type=notify
RuntimeDirectory=elasticsearch
PrivateTmp=true
Environment=ES_HOME=/usr/share/elasticsearch
Environment=ES_PATH_CONF=/etc/elasticsearch
Environment=PID_DIR=/var/run/elasticsearch
Environment=ES_SD_NOTIFY=true
EnvironmentFile=-/etc/sysconfig/elasticsearch
WorkingDirectory=/usr/share/elasticsearch
User=elasticsearch
Group=elasticsearch
ExecStart=/usr/share/elasticsearch/bin/systemd-entrypoint -p ${PID_DIR}/elasticsearch.pid --quiet
# StandardOutput is configured to redirect to journalctl since
# some error messages may be logged in standard output before
# elasticsearch logging system is initialized. Elasticsearch
# stores its logs in /var/log/elasticsearch and does not use
# journalctl by default. If you also want to enable journalctl
# logging, you can simply remove the "quiet" option from ExecStart.
StandardOutput=journal
StandardError=inherit
# Specifies the maximum file descriptor number that can be opened by this process
LimitNOFILE=65535
# Specifies the maximum number of processes
LimitNPROC=4096
# Specifies the maximum size of virtual memory
LimitAS=infinity
# Specifies the maximum file size
LimitFSIZE=infinity
# Disable timeout logic and wait until process is stopped
TimeoutStopSec=0
# SIGTERM signal is used to stop the Java process
KillSignal=SIGTERM
# Send the signal only to the JVM rather than its control group
KillMode=process
# Java process is never killed
SendSIGKILL=no
# When a JVM receives a SIGTERM signal it exits with code 143
SuccessExitStatus=143
[Install]
WantedBy=multi-user.target
# Built for packages-7.9.1 (packages)
注意其中这两行:
Type=notify
还有
Environment=ES_SD_NOTIFY=true
结合源码发现,这里elasticsearch使用的是一种通知机制,大致原理就是systemd的服务通过ExecStart启动服务后,会等待elasticsearch进程回调的一个通知,接收到该通知后,systemd会认为服务启动成果或者失败,没有等到通知就会等待直到触发超时机制,超时时间这里没有设置,所以如果elasticsearch进程没有通知systemd,就会一直卡住,永远等不到结果。
这个服务配置文件结合elasticsearch实现notify的模块,搭配在一起就很神奇了。如果没使用rpm包安装,modules目录中就不会有systemd这个模块,就肯定会卡住。就会很奇怪为什么一直不会退出。
对比一下tar包安装和rpm包安装后,modules目录中的内容,如下:
tar包安装:
[root@elk modules]# ls
aggs-matrix-stats lang-mustache spatial x-pack-ccr x-pack-ml
analysis-common lang-painless tasks x-pack-core x-pack-monitoring
constant-keyword mapper-extras transform x-pack-data-streams x-pack-ql
flattened parent-join transport-netty4 x-pack-deprecation x-pack-rollup
frozen-indices percolator vectors x-pack-enrich x-pack-security
ingest-common rank-eval wildcard x-pack-eql x-pack-sql
ingest-geoip reindex x-pack-analytics x-pack-graph x-pack-stack
ingest-user-agent repository-url x-pack-async x-pack-identity-provider x-pack-voting-only-node
kibana searchable-snapshots x-pack-async-search x-pack-ilm x-pack-watcher
lang-expression search-business-rules x-pack-autoscaling x-pack-logstash
[root@elk modules]# ls | wc
49 49 695
rpm包安装:
[root@elk modules]# ls
aggs-matrix-stats lang-mustache spatial x-pack-autoscaling x-pack-logstash
analysis-common lang-painless systemd x-pack-ccr x-pack-ml
constant-keyword mapper-extras tasks x-pack-core x-pack-monitoring
flattened parent-join transform x-pack-data-streams x-pack-ql
frozen-indices percolator transport-netty4 x-pack-deprecation x-pack-rollup
ingest-common rank-eval vectors x-pack-enrich x-pack-security
ingest-geoip reindex wildcard x-pack-eql x-pack-sql
ingest-user-agent repository-url x-pack-analytics x-pack-graph x-pack-stack
kibana searchable-snapshots x-pack-async x-pack-identity-provider x-pack-voting-only-node
lang-expression search-business-rules x-pack-async-search x-pack-ilm x-pack-watcher
[root@elk modules]# ls | wc
50 50 703
非常明显,两个差别在于一个模块,systemd,这个模块的作用很简单,可以参考模块实现的源码,其中主要代码如下:
public class SystemdPlugin extends Plugin implements ClusterPlugin {
private static final Logger logger = LogManager.getLogger(SystemdPlugin.class);
private final boolean enabled;
final boolean isEnabled() {
return enabled;
}
@SuppressWarnings("unused")
public SystemdPlugin() {
// 取出环境变量ES_SD_NOTIFY的值,取出当前包的构建类型
this(true, Build.CURRENT.type(), System.getenv("ES_SD_NOTIFY"));
}
SystemdPlugin(final boolean assertIsPackageDistribution, final Build.Type buildType, final String esSDNotify) {
// 只有在构建类型是DEB包或RPM包才会生效,否则会抛出异常
final boolean isPackageDistribution = buildType == Build.Type.DEB || buildType == Build.Type.RPM;
if (assertIsPackageDistribution) {
// our build is configured to only include this module in the package distributions
assert isPackageDistribution : buildType;
}
if (isPackageDistribution == false) {
logger.debug("disabling sd_notify as the build type [{}] is not a package distribution", buildType);
enabled = false;
return;
}
logger.trace("ES_SD_NOTIFY is set to [{}]", esSDNotify);
if (esSDNotify == null) {
enabled = false;
return;
}
if (Boolean.TRUE.toString().equals(esSDNotify) == false && Boolean.FALSE.toString().equals(esSDNotify) == false) {
throw new RuntimeException("ES_SD_NOTIFY set to unexpected value [" + esSDNotify + "]");
}
enabled = Boolean.TRUE.toString().equals(esSDNotify);
}
private final SetOnce<Scheduler.Cancellable> extender = new SetOnce<>();
Scheduler.Cancellable extender() {
return extender.get();
}
...
}
从代码中可以看出另外一个关键点,就是Build.CURRENT.type(),通过这个方法调用取出了当前包的构建类型,这个构建类型导致了是否会进行notify或者是直接抛出异常。(是不是设计的过于严谨了。。。🤣)
继续跟踪源码,找到elasticsearch/server/src/main/java/org/elasticsearch/Build.java文件,内容关键处如下:
...
public enum Type {
DEB("deb"),
DOCKER("docker"),
RPM("rpm"),
TAR("tar"),
ZIP("zip"),
UNKNOWN("unknown");
final String displayName;
public String displayName() {
return displayName;
}
Type(final String displayName) {
this.displayName = displayName;
}
public static Type fromDisplayName(final String displayName, final boolean strict) {
switch (displayName) {
case "deb":
return Type.DEB;
case "docker":
return Type.DOCKER;
case "rpm":
return Type.RPM;
case "tar":
return Type.TAR;
case "zip":
return Type.ZIP;
case "unknown":
return Type.UNKNOWN;
default:
if (strict) {
throw new IllegalStateException("unexpected distribution type [" + displayName + "]; your distribution is broken");
} else {
return Type.UNKNOWN;
}
}
}
}
static {
final Flavor flavor;
final Type type;
final String hash;
final String date;
final boolean isSnapshot;
final String version;
// these are parsed at startup, and we require that we are able to recognize the values passed in by the startup scripts
flavor = Flavor.fromDisplayName(System.getProperty("es.distribution.flavor", "unknown"), true);
// 通过读取系统变量es.distribution.type获取部署类型
type = Type.fromDisplayName(System.getProperty("es.distribution.type", "unknown"), true);
...
CURRENT = new Build(flavor, type, hash, date, isSnapshot, version);
}
...
public Type type() {
return type;
}
...
从上述代码可以看出,Build.CURRENT.type()返回的其实是具体的类型实例,类型实例是通过es.distribution.type获取到的,这个字段应该是通过jvm参数之类的配置到程序中,进行生效的。
在下一部分将定位寻找es.distribution.type的位置。
4.2 elasticsearch-env中配置的玄机
在rpm包安装目录中,进行全局搜索,使用
[root@elk elasticsearch]# grep -Rn "es.distribution.type" *
bin/elasticsearch:66: -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch:79: -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch-cli:30: -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
找到bin/elasticsearch的66行,如下:
...
58 if [[ $DAEMONIZE = false ]]; then
59 exec \
60 "$JAVA" \
61 "$XSHARE" \
62 $ES_JAVA_OPTS \
63 -Des.path.home="$ES_HOME" \
64 -Des.path.conf="$ES_PATH_CONF" \
65 -Des.distribution.flavor="$ES_DISTRIBUTION_FLAVOR" \
66 -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
67 -Des.bundled_jdk="$ES_BUNDLED_JDK" \
68 -cp "$ES_CLASSPATH" \
69 org.elasticsearch.bootstrap.Elasticsearch \
70 "$@" <<<"$KEYSTORE_PASSWORD"
...
这里有daemon和非daemon两个分支,通过阅读脚本,可以看出是执行的非daemon分支,即上边的代码。es.distribution.type的值是通过ES_DISTRIBUTION_TYPE获取到的,再次搜索ES_DISTRIBUTION_TYPE,如下:
[root@elk elasticsearch]# grep -Rn ES_DISTRIBUTION_TYPE *
bin/elasticsearch:66: -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch:79: -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch-cli:30: -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch-env:92:ES_DISTRIBUTION_TYPE=rpm
bin/elasticsearch-env:95:if [[ "$ES_DISTRIBUTION_TYPE" == "docker" ]]; then
惊奇的发现,在bin/elasticsearch-env的92行,ES_DISTRIBUTION_TYPE=rpm是直接写死的。后来通过结合分析脚本的执行逻辑,可以发现bin/elasticsearch-env脚本先生效了环境变量,然后在bin/elasticsearch中进行引用。
反观tar包中bin/elasticsearch-env的脚本内容,如下:
[root@elk elasticsearch-7.9.1]# grep -Rn ES_DISTRIBUTION_TYPE *
bin/elasticsearch-cli:30: -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch:66: -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch:79: -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch-env:92:ES_DISTRIBUTION_TYPE=tar
bin/elasticsearch-env:95:if [[ "$ES_DISTRIBUTION_TYPE" == "docker" ]]; then
bin/elasticsearch-env:92:ES_DISTRIBUTION_TYPE=tar是不是有一种恍然大悟(😂)的感觉,不同的包,脚本中预置的类型不一样,对应systemd模块的执行分支就不同。
所以,综上所述,想要将rpm包中的相关脚本抠出来结合tar包解压出来的文件一起用,需要补充systemd模块到modules文件夹里,还需要将bin/elasticsearch-env中的tar改为rpm才可以触发notify机制,实现systemctl进行启动。
4.3 社区相关讨论帖子参考
elasticsearch社区的一些讨论帖子提到该问题,可以参考一下:
https://discuss.elastic.co/t/starting-elasticsearch-with-systemd-hangs/229510
https://github.com/elastic/elasticsearch/issues/55477
https://discuss.elastic.co/t/does-the-tar-gz-archive-include-the-systemd-module/259372
https://discuss.elastic.co/t/elasticsearch-can-not-be-started-by-service/257773
https://discuss.elastic.co/t/elasticsearch-no-longer-works-under-systemd-7-4-0-on-centos-7-7-1908/201846/3
当然,上述讨论中还有其他的一些小问题可能会导致systemd启动失败,但是通过elasticsearch的日志都可以看到问题原因,对应解决就是了。一般有tmp目录权限问题,某些特殊架构平台上jna适配问题等等,不过都可以通过配置文件指定解决。有一些方法直接把notify给改掉了,我个人不建议这么做,毕竟elasticsearch启动的过程还是相对较慢的,可能执行一会会出现错误退出等等问题,还是使用官方原始的通知机制更加稳妥一些,通过文中的说明也能清楚知道问题原因和解决方案。
由于本人也是初次接触elasticsearch相关产品,文中提及如果错误,还请指出,以供大家学习参考,不胜感激!
本文介绍了当使用tar.gz包安装Elasticsearch后,如何结合systemd启动并解决启动卡住的问题。通过手动添加systemd模块、修改elasticsearch-env中的ES_DISTRIBUTION_TYPE字段,确保notify机制正常工作,从而实现Elasticsearch的顺利启动。
2932

被折叠的 条评论
为什么被折叠?



