elasticsearch通过tarball(tar.gz)方式安装,结合systemd(自定义elasticsearch.service)使用需要修改的内容

本文介绍了当使用tar.gz包安装Elasticsearch后,如何结合systemd启动并解决启动卡住的问题。通过手动添加systemd模块、修改elasticsearch-env中的ES_DISTRIBUTION_TYPE字段,确保notify机制正常工作,从而实现Elasticsearch的顺利启动。

一、前言

elasticsearch官方提供了多种安装包来进行安装的方式,选择不同的安装包进行安装,根据各自的特点,差异巨大。也是由于社区针对不同的使用场景,对安装包进行了定制,顾名思义就是安装包中的目录结构脚本文件引用的模块等等都会有些许差别,这就导致了最终安装后能够满足安装包各自的使用场景和特点,本身这个设计没有任何问题。

二、背景

在一些特殊的使用场景下,或者有喜欢自己捣鼓的同学,偶尔也有自己的一些玩儿法,这样就会碰到一些看起来很奇怪的问题。

例如,我希望使用官方提供的TAR.GZ包进行安装,同时参考RPM包中提供的相关配置文件,自己生成systemd配置文件,通过systemctl进行elasticsearch.service的启动。安装路径可以自己指定,这样便于管理文件。

正是因为这个想法,才引发了本文描述的内容。

elasticsearch版本:7.9.1

2.1 elasticsearch的TAR包和RPM包对比

使用官方提供的TAR.GZ包进行安装,会把所有elasticsearch的文件都安装到同一个目录下,没有其他的文件,解压出来的就是全部,默认需要手动启动。

RPM包使用通常使用rpm -ivh ***.rpm进行安装,通过对官方rpm包的分析,粗浅了解到它会在系统的不同目录下,按照rpm包中的规划,对应安装不同的配置文件可执行文件日志目录系统配置文件等等。例如在linux下,就会将上述文件对应安装到相应的目录中。

  • TAR.GZ包安装的一些特点列举如下:
  1. 所有的文件都放在同一个目录下
  2. 不同文件之间的引用和依赖关系都是通过相对路径来生效的
  3. 默认需要通过执行bin/elasticsearch来启动程序,可以通过-d将程序启动为daemon 模式等
  4. 不会安装其他的文件,解压即可使用
  • RPM包的一些特点:
  1. 不同的文件各自拷贝到对应的目录中去
  2. 文件之间的引用和依赖关系,有的直接通过绝对路径进行指定(配合rpm安装对应的路径),可以安装以后进行手动修改
  3. 可以通过systemctl start elasticsearch进行启动
  4. rpm包中会携带系统相关的配置文件,例如sysctl.d/内核参数配置文件等

举例来说,使用tar包安装以后,安装目录中的文件变量引用关系都是通过相对路径来寻找的,例如,通过jvm.options这个配置文件的内容可以清楚的看到差异。具体如下:
通过tar包解压出来的jvm.options文件:

## JVM configuration

################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms1g
-Xmx1g

################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################

## GC configuration
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
# 10-13:-XX:-UseConcMarkSweepGC
# 10-13:-XX:-UseCMSInitiatingOccupancyOnly
14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30

## JVM temporary directory
-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=data

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=logs/hs_err_pid%p.log

## JDK 8 GC logging
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m

通过rpm包安装,在/etc/elasticsearch/jvm.options对应的内容:


## JVM configuration

################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms1g
-Xmx1g

################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################

## GC configuration
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
# 10-13:-XX:-UseConcMarkSweepGC
# 10-13:-XX:-UseCMSInitiatingOccupancyOnly
14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30

## JVM temporary directory
-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=/var/lib/elasticsearch

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log

## JDK 8 GC logging
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m

不难发现,其中几处日志路径的地址都是不一样的,tar包中的直接采用了相对路径,rpm包中使用的则是绝对路径。(例如最后一行)

2.2 使用tar包安装后使用systemd启动遇到问题

在修复好各种环境变量问题和对应关系后,启动会一直卡住,例如:

systemctl start elasticsearch

执行后,就一直无法退出

三、解决方案

3.1 手动添加systemd模块相关文件到modules目录

第一步:将systemd模块从rpm包解压出来的modules目录中取出,放到tar包对应的modules目录里。
第二步:修改文件和文件夹的权限(systemd默认elasticsearch用户启动)

3.2 修改bin目录下elasticsearch-env中ES_DISTRIBUTION_TYPE字段值

第一步:定位bin/elasticsearch-env文件,找到ES_DISTRIBUTION_TYPE关键字,可以看到,tar包中,该关键字的值为tar,将它修改为rpm

四、问题原理

4.1 elasticsearch使用systemd的启动机制

systemd启动的service文件是通过解压官方rpm包后,从里面取出来加以修改利用的。官方的rpm包解压后,目录结构如下:

.
├── etc
│   ├── elasticsearch
│   │   ├── elasticsearch.yml
│   │   ├── jvm.options
│   │   ├── jvm.options.d
│   │   ├── log4j2.properties
│   │   ├── role_mapping.yml
│   │   ├── roles.yml
│   │   ├── users
│   │   └── users_roles
│   ├── init.d
│   │   └── elasticsearch
│   └── sysconfig
│       └── elasticsearch
├── usr
│   ├── lib
│   │   ├── sysctl.d
│   │   │   └── elasticsearch.conf
│   │   ├── systemd
│   │   │   └── system
│   │   └── tmpfiles.d
│   │       └── elasticsearch.conf
│   └── share
│       └── elasticsearch
│           ├── bin
│           ├── jdk
│           ├── lib
│           ├── LICENSE.txt
│           ├── modules
│           ├── NOTICE.txt
│           ├── plugins
│           └── README.asciidoc
└── var
    ├── lib
    │   └── elasticsearch
    └── log
        └── elasticsearch

23 directories, 15 files

取出其中的elasticsearch.service文件内容

cat usr/lib/systemd/system/elasticsearch.service

内容如下:

[Unit]
Description=Elasticsearch
Documentation=https://www.elastic.co
Wants=network-online.target
After=network-online.target

[Service]
Type=notify
RuntimeDirectory=elasticsearch
PrivateTmp=true
Environment=ES_HOME=/usr/share/elasticsearch
Environment=ES_PATH_CONF=/etc/elasticsearch
Environment=PID_DIR=/var/run/elasticsearch
Environment=ES_SD_NOTIFY=true
EnvironmentFile=-/etc/sysconfig/elasticsearch

WorkingDirectory=/usr/share/elasticsearch

User=elasticsearch
Group=elasticsearch

ExecStart=/usr/share/elasticsearch/bin/systemd-entrypoint -p ${PID_DIR}/elasticsearch.pid --quiet

# StandardOutput is configured to redirect to journalctl since
# some error messages may be logged in standard output before
# elasticsearch logging system is initialized. Elasticsearch
# stores its logs in /var/log/elasticsearch and does not use
# journalctl by default. If you also want to enable journalctl
# logging, you can simply remove the "quiet" option from ExecStart.
StandardOutput=journal
StandardError=inherit

# Specifies the maximum file descriptor number that can be opened by this process
LimitNOFILE=65535

# Specifies the maximum number of processes
LimitNPROC=4096

# Specifies the maximum size of virtual memory
LimitAS=infinity

# Specifies the maximum file size
LimitFSIZE=infinity

# Disable timeout logic and wait until process is stopped
TimeoutStopSec=0

# SIGTERM signal is used to stop the Java process
KillSignal=SIGTERM

# Send the signal only to the JVM rather than its control group
KillMode=process

# Java process is never killed
SendSIGKILL=no

# When a JVM receives a SIGTERM signal it exits with code 143
SuccessExitStatus=143

[Install]
WantedBy=multi-user.target

# Built for packages-7.9.1 (packages)

注意其中这两行:

Type=notify

还有

Environment=ES_SD_NOTIFY=true

结合源码发现,这里elasticsearch使用的是一种通知机制,大致原理就是systemd的服务通过ExecStart启动服务后,会等待elasticsearch进程回调的一个通知,接收到该通知后,systemd会认为服务启动成果或者失败,没有等到通知就会等待直到触发超时机制,超时时间这里没有设置,所以如果elasticsearch进程没有通知systemd,就会一直卡住,永远等不到结果。

这个服务配置文件结合elasticsearch实现notify的模块,搭配在一起就很神奇了。如果没使用rpm包安装,modules目录中就不会有systemd这个模块,就肯定会卡住。就会很奇怪为什么一直不会退出。

对比一下tar包安装和rpm包安装后,modules目录中的内容,如下:
tar包安装:

[root@elk modules]# ls
aggs-matrix-stats  lang-mustache          spatial              x-pack-ccr                x-pack-ml
analysis-common    lang-painless          tasks                x-pack-core               x-pack-monitoring
constant-keyword   mapper-extras          transform            x-pack-data-streams       x-pack-ql
flattened          parent-join            transport-netty4     x-pack-deprecation        x-pack-rollup
frozen-indices     percolator             vectors              x-pack-enrich             x-pack-security
ingest-common      rank-eval              wildcard             x-pack-eql                x-pack-sql
ingest-geoip       reindex                x-pack-analytics     x-pack-graph              x-pack-stack
ingest-user-agent  repository-url         x-pack-async         x-pack-identity-provider  x-pack-voting-only-node
kibana             searchable-snapshots   x-pack-async-search  x-pack-ilm                x-pack-watcher
lang-expression    search-business-rules  x-pack-autoscaling   x-pack-logstash
[root@elk modules]# ls | wc
     49      49     695

rpm包安装:

[root@elk modules]# ls
aggs-matrix-stats  lang-mustache          spatial              x-pack-autoscaling        x-pack-logstash
analysis-common    lang-painless          systemd              x-pack-ccr                x-pack-ml
constant-keyword   mapper-extras          tasks                x-pack-core               x-pack-monitoring
flattened          parent-join            transform            x-pack-data-streams       x-pack-ql
frozen-indices     percolator             transport-netty4     x-pack-deprecation        x-pack-rollup
ingest-common      rank-eval              vectors              x-pack-enrich             x-pack-security
ingest-geoip       reindex                wildcard             x-pack-eql                x-pack-sql
ingest-user-agent  repository-url         x-pack-analytics     x-pack-graph              x-pack-stack
kibana             searchable-snapshots   x-pack-async         x-pack-identity-provider  x-pack-voting-only-node
lang-expression    search-business-rules  x-pack-async-search  x-pack-ilm                x-pack-watcher
[root@elk modules]# ls | wc
     50      50     703

非常明显,两个差别在于一个模块,systemd,这个模块的作用很简单,可以参考模块实现的源码,其中主要代码如下:

public class SystemdPlugin extends Plugin implements ClusterPlugin {

    private static final Logger logger = LogManager.getLogger(SystemdPlugin.class);

    private final boolean enabled;

    final boolean isEnabled() {
        return enabled;
    }

    @SuppressWarnings("unused")
    public SystemdPlugin() {
        // 取出环境变量ES_SD_NOTIFY的值,取出当前包的构建类型
        this(true, Build.CURRENT.type(), System.getenv("ES_SD_NOTIFY"));
    }

    SystemdPlugin(final boolean assertIsPackageDistribution, final Build.Type buildType, final String esSDNotify) {
        // 只有在构建类型是DEB包或RPM包才会生效,否则会抛出异常
        final boolean isPackageDistribution = buildType == Build.Type.DEB || buildType == Build.Type.RPM;
        if (assertIsPackageDistribution) {
            // our build is configured to only include this module in the package distributions
            assert isPackageDistribution : buildType;
        }
        if (isPackageDistribution == false) {
            logger.debug("disabling sd_notify as the build type [{}] is not a package distribution", buildType);
            enabled = false;
            return;
        }
        logger.trace("ES_SD_NOTIFY is set to [{}]", esSDNotify);
        if (esSDNotify == null) {
            enabled = false;
            return;
        }
        if (Boolean.TRUE.toString().equals(esSDNotify) == false && Boolean.FALSE.toString().equals(esSDNotify) == false) {
            throw new RuntimeException("ES_SD_NOTIFY set to unexpected value [" + esSDNotify + "]");
        }
        enabled = Boolean.TRUE.toString().equals(esSDNotify);
    }

    private final SetOnce<Scheduler.Cancellable> extender = new SetOnce<>();

    Scheduler.Cancellable extender() {
        return extender.get();
    }
...
}

从代码中可以看出另外一个关键点,就是Build.CURRENT.type(),通过这个方法调用取出了当前包的构建类型,这个构建类型导致了是否会进行notify或者是直接抛出异常。(是不是设计的过于严谨了。。。🤣)

继续跟踪源码,找到elasticsearch/server/src/main/java/org/elasticsearch/Build.java文件,内容关键处如下:

...
    public enum Type {

        DEB("deb"),
        DOCKER("docker"),
        RPM("rpm"),
        TAR("tar"),
        ZIP("zip"),
        UNKNOWN("unknown");

        final String displayName;

        public String displayName() {
            return displayName;
        }

        Type(final String displayName) {
            this.displayName = displayName;
        }

        public static Type fromDisplayName(final String displayName, final boolean strict) {
            switch (displayName) {
                case "deb":
                    return Type.DEB;
                case "docker":
                    return Type.DOCKER;
                case "rpm":
                    return Type.RPM;
                case "tar":
                    return Type.TAR;
                case "zip":
                    return Type.ZIP;
                case "unknown":
                    return Type.UNKNOWN;
                default:
                    if (strict) {
                        throw new IllegalStateException("unexpected distribution type [" + displayName + "]; your distribution is broken");
                    } else {
                        return Type.UNKNOWN;
                    }
            }
        }

    }
    static {
        final Flavor flavor;
        final Type type;
        final String hash;
        final String date;
        final boolean isSnapshot;
        final String version;

        // these are parsed at startup, and we require that we are able to recognize the values passed in by the startup scripts
        flavor = Flavor.fromDisplayName(System.getProperty("es.distribution.flavor", "unknown"), true);
        // 通过读取系统变量es.distribution.type获取部署类型
        type = Type.fromDisplayName(System.getProperty("es.distribution.type", "unknown"), true);

...

        CURRENT = new Build(flavor, type, hash, date, isSnapshot, version);
    }
...
    public Type type() {
        return type;
    }
...

从上述代码可以看出,Build.CURRENT.type()返回的其实是具体的类型实例,类型实例是通过es.distribution.type获取到的,这个字段应该是通过jvm参数之类的配置到程序中,进行生效的。

在下一部分将定位寻找es.distribution.type的位置。

4.2 elasticsearch-env中配置的玄机

在rpm包安装目录中,进行全局搜索,使用

[root@elk elasticsearch]# grep -Rn "es.distribution.type" *
bin/elasticsearch:66:    -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch:79:    -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch-cli:30:  -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \

找到bin/elasticsearch的66行,如下:

...
     58 if [[ $DAEMONIZE = false ]]; then
     59   exec \
     60     "$JAVA" \
     61     "$XSHARE" \
     62     $ES_JAVA_OPTS \
     63     -Des.path.home="$ES_HOME" \
     64     -Des.path.conf="$ES_PATH_CONF" \
     65     -Des.distribution.flavor="$ES_DISTRIBUTION_FLAVOR" \
     66     -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
     67     -Des.bundled_jdk="$ES_BUNDLED_JDK" \
     68     -cp "$ES_CLASSPATH" \
     69     org.elasticsearch.bootstrap.Elasticsearch \
     70     "$@" <<<"$KEYSTORE_PASSWORD"
...

这里有daemon和非daemon两个分支,通过阅读脚本,可以看出是执行的非daemon分支,即上边的代码。es.distribution.type的值是通过ES_DISTRIBUTION_TYPE获取到的,再次搜索ES_DISTRIBUTION_TYPE,如下:

[root@elk elasticsearch]# grep -Rn ES_DISTRIBUTION_TYPE *
bin/elasticsearch:66:    -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch:79:    -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch-cli:30:  -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch-env:92:ES_DISTRIBUTION_TYPE=rpm
bin/elasticsearch-env:95:if [[ "$ES_DISTRIBUTION_TYPE" == "docker" ]]; then

惊奇的发现,在bin/elasticsearch-env的92行,ES_DISTRIBUTION_TYPE=rpm是直接写死的。后来通过结合分析脚本的执行逻辑,可以发现bin/elasticsearch-env脚本先生效了环境变量,然后在bin/elasticsearch中进行引用。

反观tar包中bin/elasticsearch-env的脚本内容,如下:

[root@elk elasticsearch-7.9.1]# grep -Rn ES_DISTRIBUTION_TYPE *
bin/elasticsearch-cli:30:  -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch:66:    -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch:79:    -Des.distribution.type="$ES_DISTRIBUTION_TYPE" \
bin/elasticsearch-env:92:ES_DISTRIBUTION_TYPE=tar
bin/elasticsearch-env:95:if [[ "$ES_DISTRIBUTION_TYPE" == "docker" ]]; then

bin/elasticsearch-env:92:ES_DISTRIBUTION_TYPE=tar是不是有一种恍然大悟(😂)的感觉,不同的包,脚本中预置的类型不一样,对应systemd模块的执行分支就不同。

所以,综上所述,想要将rpm包中的相关脚本抠出来结合tar包解压出来的文件一起用,需要补充systemd模块到modules文件夹里,还需要将bin/elasticsearch-env中的tar改为rpm才可以触发notify机制,实现systemctl进行启动。

4.3 社区相关讨论帖子参考

elasticsearch社区的一些讨论帖子提到该问题,可以参考一下:
https://discuss.elastic.co/t/starting-elasticsearch-with-systemd-hangs/229510
https://github.com/elastic/elasticsearch/issues/55477
https://discuss.elastic.co/t/does-the-tar-gz-archive-include-the-systemd-module/259372
https://discuss.elastic.co/t/elasticsearch-can-not-be-started-by-service/257773
https://discuss.elastic.co/t/elasticsearch-no-longer-works-under-systemd-7-4-0-on-centos-7-7-1908/201846/3

当然,上述讨论中还有其他的一些小问题可能会导致systemd启动失败,但是通过elasticsearch的日志都可以看到问题原因,对应解决就是了。一般有tmp目录权限问题,某些特殊架构平台上jna适配问题等等,不过都可以通过配置文件指定解决。有一些方法直接把notify给改掉了,我个人不建议这么做,毕竟elasticsearch启动的过程还是相对较慢的,可能执行一会会出现错误退出等等问题,还是使用官方原始的通知机制更加稳妥一些,通过文中的说明也能清楚知道问题原因和解决方案。

由于本人也是初次接触elasticsearch相关产品,文中提及如果错误,还请指出,以供大家学习参考,不胜感激!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值