Skip to content

[SUPPORT] unable to sync metadata to hive metastore #13057

Open
@Souldiv

Description

@Souldiv

Describe the problem you faced

I am trying to store table metadata in hive metastore using the following spark command. I have followed the config as shown here. And the following command is run:

spark-submit --class org.apache.hudi.utilities.streamer.HoodieStr
eamer $HUDI_UTILITIES_BUNDLE \
--table-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field ts \
--target-base-path hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2  \
--target-table stock_ticks_cow_2  \
--schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \
--hoodie-conf hoodie.streamer.schemaprovider.registry.url=http://localhost:8081/subjects/stock_ticks-value/versions/latest \
--hoodie-conf hoodie.streamer.source.kafka.topic=stock_ticks \
--hoodie-conf hoodie.datasource.write.recordkey.field=key \
--hoodie-conf hoodie.datasource.write.partitionpath.field=date \
--hoodie-conf schema.registry.url=http://localhost:8081 \
--hoodie-conf auto.offset.reset=earliest \
--hoodie-conf bootstrap.servers=localhost:9092 \
--hoodie-conf hoodie.upsert.shuffle.parallelism=2 \
--hoodie-conf hoodie.insert.shuffle.parallelism=2 \
--hoodie-conf hoodie.delete.shuffle.parallelism=2 \
--hoodie-conf hoodie.bulkinsert.shuffle.parallelism=2 \
--hoodie-conf hoodie.datasource.hive_sync.mode=hms \
--hoodie-conf hoodie.datasource.hive_sync.enable=true \
--hoodie-conf hoodie.datasource.hive_sync.metastore.uris=thrift://localhost:9083 \
--hoodie-conf hoodie.datasource.hive_sync.table=stock_ticks_cow_2 \
--hoodie-conf hoodie.datasource.meta.sync.enable=true \
--hoodie-conf hoodie.datasource.hive_sync.batch_num=10 \
--props file:///dev/null

spark writes the table as intended to hdfs but I don't see the table metadata in hive through beeline. Please let me know if I am missing any required configuration or If I have misunderstood the purpose of this configuration.

To Reproduce

Steps to reproduce the behavior:

  1. push stock data to stock_ticks topic
  2. run above spark command
  3. check from beeline if tables shows up using show tables;

Expected behavior

I was expecting the table metadata to be synced with hive upon running the spark command with hive configuration.

Environment Description

  • Hudi version : 0.15

  • Spark version : 3.5.5

  • Hive version : 2.3.9

  • Hadoop version : 3.4.1

  • Storage (HDFS/S3/GCS..) : HDFS

  • Running on Docker? (yes/no) : No

Stacktrace

25/03/30 17:42:33 WARN Utils: Your hostname, hudi resolves to a loopback address: 127.0.1.1; using 10.0.0.108 instead (on interface eth0)
25/03/30 17:42:33 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
25/03/30 17:42:33 WARN SchedulerConfGenerator: Job Scheduling Configs will not be in effect as spark.scheduler.mode is not set to FAIR at instantiation time. Continuing without scheduling configs
25/03/30 17:42:34 INFO SparkContext: Running Spark version 3.5.5
25/03/30 17:42:34 INFO SparkContext: OS info Linux, 6.8.4-3-pve, amd64
25/03/30 17:42:34 INFO SparkContext: Java version 1.8.0_442
25/03/30 17:42:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/03/30 17:42:34 INFO ResourceUtils: ==============================================================
25/03/30 17:42:34 INFO ResourceUtils: No custom resources configured for spark.driver.
25/03/30 17:42:34 INFO ResourceUtils: ==============================================================
25/03/30 17:42:34 INFO SparkContext: Submitted application: streamer-stock_ticks_cow_2
25/03/30 17:42:34 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/03/30 17:42:34 INFO ResourceProfile: Limiting resource is cpu
25/03/30 17:42:34 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/03/30 17:42:34 INFO SecurityManager: Changing view acls to: conuser
25/03/30 17:42:34 INFO SecurityManager: Changing modify acls to: conuser
25/03/30 17:42:34 INFO SecurityManager: Changing view acls groups to: 
25/03/30 17:42:34 INFO SecurityManager: Changing modify acls groups to: 
25/03/30 17:42:34 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: conuser; groups with view permissions: EMPTY; users with modify permissions: conuser; groups with modify permissions: EMPTY
25/03/30 17:42:34 INFO deprecation: mapred.output.compression.codec is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.codec
25/03/30 17:42:34 INFO deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
25/03/30 17:42:34 INFO deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
25/03/30 17:42:34 INFO Utils: Successfully started service 'sparkDriver' on port 44127.
25/03/30 17:42:34 INFO SparkEnv: Registering MapOutputTracker
25/03/30 17:42:34 INFO SparkEnv: Registering BlockManagerMaster
25/03/30 17:42:34 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
25/03/30 17:42:34 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
25/03/30 17:42:34 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/03/30 17:42:34 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-970f83dc-4465-4290-a3dd-b6a401ed3feb
25/03/30 17:42:34 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
25/03/30 17:42:34 INFO SparkEnv: Registering OutputCommitCoordinator
25/03/30 17:42:34 INFO JettyUtils: Start Jetty 0.0.0.0:8090 for SparkUI
25/03/30 17:42:34 WARN Utils: Service 'SparkUI' could not bind on port 8090. Attempting port 8091.
25/03/30 17:42:34 INFO Utils: Successfully started service 'SparkUI' on port 8091.
25/03/30 17:42:34 INFO SparkContext: Added JAR file:/home/conuser/downloads/hudi-0.15.0/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.15.0.jar at spark://10.0.0.108:44127/jars/hudi-utilities-bundle_2.12-0.15.0.jar with timestamp 1743356554014
25/03/30 17:42:34 INFO Executor: Starting executor ID driver on host 10.0.0.108
25/03/30 17:42:34 INFO Executor: OS info Linux, 6.8.4-3-pve, amd64
25/03/30 17:42:34 INFO Executor: Java version 1.8.0_442
25/03/30 17:42:34 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): ''
25/03/30 17:42:34 INFO Executor: Created or updated repl class loader org.apache.spark.util.MutableURLClassLoader@365a6a43 for default.
25/03/30 17:42:34 INFO Executor: Fetching spark://10.0.0.108:44127/jars/hudi-utilities-bundle_2.12-0.15.0.jar with timestamp 1743356554014
25/03/30 17:42:34 INFO TransportClientFactory: Successfully created connection to /10.0.0.108:44127 after 19 ms (0 ms spent in bootstraps)
25/03/30 17:42:34 INFO Utils: Fetching spark://10.0.0.108:44127/jars/hudi-utilities-bundle_2.12-0.15.0.jar to /tmp/spark-8b36c157-3895-45ce-86b2-5a063c272795/userFiles-2caada7f-5b56-4053-8db1-5b00562db47c/fetchFileTemp821209291924917814.tmp
25/03/30 17:42:34 INFO Executor: Adding file:/tmp/spark-8b36c157-3895-45ce-86b2-5a063c272795/userFiles-2caada7f-5b56-4053-8db1-5b00562db47c/hudi-utilities-bundle_2.12-0.15.0.jar to class loader default
25/03/30 17:42:34 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35865.
25/03/30 17:42:34 INFO NettyBlockTransferService: Server created on 10.0.0.108:35865
25/03/30 17:42:34 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
25/03/30 17:42:34 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.0.108, 35865, None)
25/03/30 17:42:34 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.0.108:35865 with 366.3 MiB RAM, BlockManagerId(driver, 10.0.0.108, 35865, None)
25/03/30 17:42:34 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.0.108, 35865, None)
25/03/30 17:42:34 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.0.108, 35865, None)
25/03/30 17:42:35 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
25/03/30 17:42:35 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
25/03/30 17:42:35 INFO UtilHelpers: Adding overridden properties to file properties.
25/03/30 17:42:35 INFO SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir.
25/03/30 17:42:35 INFO SharedState: Warehouse path is 'hdfs://localhost:9000/user/hive/warehouse'.
25/03/30 17:42:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:35 INFO HoodieTableConfig: Loading table properties from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2/.hoodie/hoodie.properties
25/03/30 17:42:35 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:35 INFO HoodieStreamer: Creating Hudi Streamer with configs:
auto.offset.reset: earliest
bootstrap.servers: localhost:9092
hoodie.auto.adjust.lock.configs: true
hoodie.bulkinsert.shuffle.parallelism: 2
hoodie.datasource.hive_sync.batch_num: 10
hoodie.datasource.hive_sync.enable: true
hoodie.datasource.hive_sync.metastore.uris: thrift://localhost:9083
hoodie.datasource.hive_sync.mode: hms
hoodie.datasource.hive_sync.table: stock_ticks_cow_2
hoodie.datasource.meta.sync.enable: true
hoodie.datasource.write.partitionpath.field: date
hoodie.datasource.write.reconcile.schema: false
hoodie.datasource.write.recordkey.field: key
hoodie.delete.shuffle.parallelism: 2
hoodie.insert.shuffle.parallelism: 2
hoodie.streamer.schemaprovider.registry.url: http://localhost:8081/subjects/stock_ticks-value/versions/latest
hoodie.streamer.source.kafka.topic: stock_ticks
hoodie.upsert.shuffle.parallelism: 2
schema.registry.url: http://localhost:8081

25/03/30 17:42:35 INFO HoodieSparkKeyGeneratorFactory: The value of hoodie.datasource.write.keygenerator.type is empty; inferred to be SIMPLE
25/03/30 17:42:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:35 INFO HoodieTableConfig: Loading table properties from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2/.hoodie/hoodie.properties
25/03/30 17:42:35 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:35 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20250330173718165__commit__COMPLETED__20250330173723152]}
25/03/30 17:42:35 INFO HoodieIngestionService: Ingestion service starts running in run-once mode
25/03/30 17:42:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:35 INFO HoodieTableConfig: Loading table properties from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2/.hoodie/hoodie.properties
25/03/30 17:42:35 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:35 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20250330173718165__commit__COMPLETED__20250330173723152]}
25/03/30 17:42:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:35 INFO HoodieTableConfig: Loading table properties from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2/.hoodie/hoodie.properties
25/03/30 17:42:35 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:36 INFO StreamSync: Checkpoint to resume from : Option{val=stock_ticks,0:3482}
25/03/30 17:42:36 INFO KafkaOffsetGen: SourceLimit not configured, set numEvents to default value : 5000000
25/03/30 17:42:36 INFO KafkaOffsetGen: getNextOffsetRanges set config hoodie.streamer.source.kafka.minPartitions to 0
25/03/30 17:42:36 INFO ConsumerConfig: ConsumerConfig values: 
	allow.auto.create.topics = true
	auto.commit.interval.ms = 5000
	auto.offset.reset = earliest
	bootstrap.servers = [localhost:9092]
	check.crcs = true
	client.dns.lookup = use_all_dns_ips
	client.id = consumer-null-1
	client.rack = 
	connections.max.idle.ms = 540000
	default.api.timeout.ms = 60000
	enable.auto.commit = true
	exclude.internal.topics = true
	fetch.max.bytes = 52428800
	fetch.max.wait.ms = 500
	fetch.min.bytes = 1
	group.id = null
	group.instance.id = null
	heartbeat.interval.ms = 3000
	interceptor.classes = []
	internal.leave.group.on.close = true
	internal.throw.on.fetch.stable.offset.unsupported = false
	isolation.level = read_uncommitted
	key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
	max.partition.fetch.bytes = 1048576
	max.poll.interval.ms = 300000
	max.poll.records = 500
	metadata.max.age.ms = 300000
	metric.reporters = []
	metrics.num.samples = 2
	metrics.recording.level = INFO
	metrics.sample.window.ms = 30000
	partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
	receive.buffer.bytes = 65536
	reconnect.backoff.max.ms = 1000
	reconnect.backoff.ms = 50
	request.timeout.ms = 30000
	retry.backoff.ms = 100
	sasl.client.callback.handler.class = null
	sasl.jaas.config = null
	sasl.kerberos.kinit.cmd = /usr/bin/kinit
	sasl.kerberos.min.time.before.relogin = 60000
	sasl.kerberos.service.name = null
	sasl.kerberos.ticket.renew.jitter = 0.05
	sasl.kerberos.ticket.renew.window.factor = 0.8
	sasl.login.callback.handler.class = null
	sasl.login.class = null
	sasl.login.refresh.buffer.seconds = 300
	sasl.login.refresh.min.period.seconds = 60
	sasl.login.refresh.window.factor = 0.8
	sasl.login.refresh.window.jitter = 0.05
	sasl.mechanism = GSSAPI
	security.protocol = PLAINTEXT
	security.providers = null
	send.buffer.bytes = 131072
	session.timeout.ms = 10000
	socket.connection.setup.timeout.max.ms = 30000
	socket.connection.setup.timeout.ms = 10000
	ssl.cipher.suites = null
	ssl.enabled.protocols = [TLSv1.2]
	ssl.endpoint.identification.algorithm = https
	ssl.engine.factory.class = null
	ssl.key.password = null
	ssl.keymanager.algorithm = SunX509
	ssl.keystore.certificate.chain = null
	ssl.keystore.key = null
	ssl.keystore.location = null
	ssl.keystore.password = null
	ssl.keystore.type = JKS
	ssl.protocol = TLSv1.2
	ssl.provider = null
	ssl.secure.random.implementation = null
	ssl.trustmanager.algorithm = PKIX
	ssl.truststore.certificates = null
	ssl.truststore.location = null
	ssl.truststore.password = null
	ssl.truststore.type = JKS
	value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer

25/03/30 17:42:36 WARN ConsumerConfig: The configuration 'schema.registry.url' was supplied but isn't a known config.
25/03/30 17:42:36 INFO AppInfoParser: Kafka version: 2.8.0
25/03/30 17:42:36 INFO AppInfoParser: Kafka commitId: ebb1d6e21cc92130
25/03/30 17:42:36 INFO AppInfoParser: Kafka startTimeMs: 1743356556089
25/03/30 17:42:36 INFO Metadata: [Consumer clientId=consumer-null-1, groupId=null] Cluster ID: Nk-xOeixRZGj41miDeXdjQ
25/03/30 17:42:36 INFO Metrics: Metrics scheduler closed
25/03/30 17:42:36 INFO Metrics: Closing reporter org.apache.kafka.common.metrics.JmxReporter
25/03/30 17:42:36 INFO Metrics: Metrics reporters closed
25/03/30 17:42:36 INFO AppInfoParser: App info kafka.consumer for consumer-null-1 unregistered
25/03/30 17:42:36 INFO KafkaOffsetGen: final ranges [OffsetRange(topic: 'stock_ticks', partition: 0, range: [3482 -> 3482])]
25/03/30 17:42:36 INFO KafkaSource: About to read sourceLimit 9223372036854775807 in 0 spark partitions from kafka for topic stock_ticks with offset ranges [OffsetRange(topic: 'stock_ticks', partition: 0, range: [3482 -> 3482])]
25/03/30 17:42:36 INFO KafkaSource: About to read 0 from Kafka for topic :stock_ticks
25/03/30 17:42:36 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20250330173718165__commit__COMPLETED__20250330173723152]}
25/03/30 17:42:36 INFO UtilHelpers: Adding overridden properties to file properties.
25/03/30 17:42:36 INFO StreamSync: No new data, source checkpoint has not changed. Nothing to commit. Old checkpoint=(Option{val=stock_ticks,0:3482}). New Checkpoint=(stock_ticks,0:3482)
25/03/30 17:42:36 INFO StreamSync: Shutting down embedded timeline server
25/03/30 17:42:36 INFO HoodieIngestionService: Ingestion service (run-once mode) has been shut down.
25/03/30 17:42:36 INFO SparkContext: SparkContext is stopping with exitCode 0.
25/03/30 17:42:36 INFO SparkUI: Stopped Spark web UI at http://10.0.0.108:8091
25/03/30 17:42:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
25/03/30 17:42:36 INFO MemoryStore: MemoryStore cleared
25/03/30 17:42:36 INFO BlockManager: BlockManager stopped
25/03/30 17:42:36 INFO BlockManagerMaster: BlockManagerMaster stopped
25/03/30 17:42:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
25/03/30 17:42:36 INFO SparkContext: Successfully stopped SparkContext
25/03/30 17:42:36 INFO ShutdownHookManager: Shutdown hook called
25/03/30 17:42:36 INFO ShutdownHookManager: Deleting directory /tmp/spark-37076236-cc75-4ba3-a7bc-65a0778326a0
25/03/30 17:42:36 INFO ShutdownHookManager: Deleting directory /tmp/spark-8b36c157-3895-45ce-86b2-5a063c272795

Metadata

Metadata

Assignees

No one assigned

    Labels

    hiveIssues related to hivehudistreamerissues related to Hudi streamer (Formely deltastreamer)meta-sync

    Type

    Projects

    Status

    ⏳ Awaiting Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions