Skip to content

[SUPPORT] unable to sync metadata to hive metastore #13057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Souldiv opened this issue Mar 30, 2025 · 3 comments
Open

[SUPPORT] unable to sync metadata to hive metastore #13057

Souldiv opened this issue Mar 30, 2025 · 3 comments
Labels
hive Issues related to hive hudistreamer issues related to Hudi streamer (Formely deltastreamer) meta-sync

Comments

@Souldiv
Copy link

Souldiv commented Mar 30, 2025

Describe the problem you faced

I am trying to store table metadata in hive metastore using the following spark command. I have followed the config as shown here. And the following command is run:

spark-submit --class org.apache.hudi.utilities.streamer.HoodieStr
eamer $HUDI_UTILITIES_BUNDLE \
--table-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field ts \
--target-base-path hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2  \
--target-table stock_ticks_cow_2  \
--schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \
--hoodie-conf hoodie.streamer.schemaprovider.registry.url=http://localhost:8081/subjects/stock_ticks-value/versions/latest \
--hoodie-conf hoodie.streamer.source.kafka.topic=stock_ticks \
--hoodie-conf hoodie.datasource.write.recordkey.field=key \
--hoodie-conf hoodie.datasource.write.partitionpath.field=date \
--hoodie-conf schema.registry.url=http://localhost:8081 \
--hoodie-conf auto.offset.reset=earliest \
--hoodie-conf bootstrap.servers=localhost:9092 \
--hoodie-conf hoodie.upsert.shuffle.parallelism=2 \
--hoodie-conf hoodie.insert.shuffle.parallelism=2 \
--hoodie-conf hoodie.delete.shuffle.parallelism=2 \
--hoodie-conf hoodie.bulkinsert.shuffle.parallelism=2 \
--hoodie-conf hoodie.datasource.hive_sync.mode=hms \
--hoodie-conf hoodie.datasource.hive_sync.enable=true \
--hoodie-conf hoodie.datasource.hive_sync.metastore.uris=thrift://localhost:9083 \
--hoodie-conf hoodie.datasource.hive_sync.table=stock_ticks_cow_2 \
--hoodie-conf hoodie.datasource.meta.sync.enable=true \
--hoodie-conf hoodie.datasource.hive_sync.batch_num=10 \
--props file:///dev/null

spark writes the table as intended to hdfs but I don't see the table metadata in hive through beeline. Please let me know if I am missing any required configuration or If I have misunderstood the purpose of this configuration.

To Reproduce

Steps to reproduce the behavior:

  1. push stock data to stock_ticks topic
  2. run above spark command
  3. check from beeline if tables shows up using show tables;

Expected behavior

I was expecting the table metadata to be synced with hive upon running the spark command with hive configuration.

Environment Description

  • Hudi version : 0.15

  • Spark version : 3.5.5

  • Hive version : 2.3.9

  • Hadoop version : 3.4.1

  • Storage (HDFS/S3/GCS..) : HDFS

  • Running on Docker? (yes/no) : No

Stacktrace

25/03/30 17:42:33 WARN Utils: Your hostname, hudi resolves to a loopback address: 127.0.1.1; using 10.0.0.108 instead (on interface eth0)
25/03/30 17:42:33 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
25/03/30 17:42:33 WARN SchedulerConfGenerator: Job Scheduling Configs will not be in effect as spark.scheduler.mode is not set to FAIR at instantiation time. Continuing without scheduling configs
25/03/30 17:42:34 INFO SparkContext: Running Spark version 3.5.5
25/03/30 17:42:34 INFO SparkContext: OS info Linux, 6.8.4-3-pve, amd64
25/03/30 17:42:34 INFO SparkContext: Java version 1.8.0_442
25/03/30 17:42:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/03/30 17:42:34 INFO ResourceUtils: ==============================================================
25/03/30 17:42:34 INFO ResourceUtils: No custom resources configured for spark.driver.
25/03/30 17:42:34 INFO ResourceUtils: ==============================================================
25/03/30 17:42:34 INFO SparkContext: Submitted application: streamer-stock_ticks_cow_2
25/03/30 17:42:34 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/03/30 17:42:34 INFO ResourceProfile: Limiting resource is cpu
25/03/30 17:42:34 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/03/30 17:42:34 INFO SecurityManager: Changing view acls to: conuser
25/03/30 17:42:34 INFO SecurityManager: Changing modify acls to: conuser
25/03/30 17:42:34 INFO SecurityManager: Changing view acls groups to: 
25/03/30 17:42:34 INFO SecurityManager: Changing modify acls groups to: 
25/03/30 17:42:34 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: conuser; groups with view permissions: EMPTY; users with modify permissions: conuser; groups with modify permissions: EMPTY
25/03/30 17:42:34 INFO deprecation: mapred.output.compression.codec is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.codec
25/03/30 17:42:34 INFO deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
25/03/30 17:42:34 INFO deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
25/03/30 17:42:34 INFO Utils: Successfully started service 'sparkDriver' on port 44127.
25/03/30 17:42:34 INFO SparkEnv: Registering MapOutputTracker
25/03/30 17:42:34 INFO SparkEnv: Registering BlockManagerMaster
25/03/30 17:42:34 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
25/03/30 17:42:34 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
25/03/30 17:42:34 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/03/30 17:42:34 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-970f83dc-4465-4290-a3dd-b6a401ed3feb
25/03/30 17:42:34 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
25/03/30 17:42:34 INFO SparkEnv: Registering OutputCommitCoordinator
25/03/30 17:42:34 INFO JettyUtils: Start Jetty 0.0.0.0:8090 for SparkUI
25/03/30 17:42:34 WARN Utils: Service 'SparkUI' could not bind on port 8090. Attempting port 8091.
25/03/30 17:42:34 INFO Utils: Successfully started service 'SparkUI' on port 8091.
25/03/30 17:42:34 INFO SparkContext: Added JAR file:/home/conuser/downloads/hudi-0.15.0/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.15.0.jar at spark://10.0.0.108:44127/jars/hudi-utilities-bundle_2.12-0.15.0.jar with timestamp 1743356554014
25/03/30 17:42:34 INFO Executor: Starting executor ID driver on host 10.0.0.108
25/03/30 17:42:34 INFO Executor: OS info Linux, 6.8.4-3-pve, amd64
25/03/30 17:42:34 INFO Executor: Java version 1.8.0_442
25/03/30 17:42:34 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): ''
25/03/30 17:42:34 INFO Executor: Created or updated repl class loader org.apache.spark.util.MutableURLClassLoader@365a6a43 for default.
25/03/30 17:42:34 INFO Executor: Fetching spark://10.0.0.108:44127/jars/hudi-utilities-bundle_2.12-0.15.0.jar with timestamp 1743356554014
25/03/30 17:42:34 INFO TransportClientFactory: Successfully created connection to /10.0.0.108:44127 after 19 ms (0 ms spent in bootstraps)
25/03/30 17:42:34 INFO Utils: Fetching spark://10.0.0.108:44127/jars/hudi-utilities-bundle_2.12-0.15.0.jar to /tmp/spark-8b36c157-3895-45ce-86b2-5a063c272795/userFiles-2caada7f-5b56-4053-8db1-5b00562db47c/fetchFileTemp821209291924917814.tmp
25/03/30 17:42:34 INFO Executor: Adding file:/tmp/spark-8b36c157-3895-45ce-86b2-5a063c272795/userFiles-2caada7f-5b56-4053-8db1-5b00562db47c/hudi-utilities-bundle_2.12-0.15.0.jar to class loader default
25/03/30 17:42:34 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35865.
25/03/30 17:42:34 INFO NettyBlockTransferService: Server created on 10.0.0.108:35865
25/03/30 17:42:34 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
25/03/30 17:42:34 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.0.108, 35865, None)
25/03/30 17:42:34 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.0.108:35865 with 366.3 MiB RAM, BlockManagerId(driver, 10.0.0.108, 35865, None)
25/03/30 17:42:34 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.0.108, 35865, None)
25/03/30 17:42:34 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.0.108, 35865, None)
25/03/30 17:42:35 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
25/03/30 17:42:35 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
25/03/30 17:42:35 INFO UtilHelpers: Adding overridden properties to file properties.
25/03/30 17:42:35 INFO SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir.
25/03/30 17:42:35 INFO SharedState: Warehouse path is 'hdfs://localhost:9000/user/hive/warehouse'.
25/03/30 17:42:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:35 INFO HoodieTableConfig: Loading table properties from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2/.hoodie/hoodie.properties
25/03/30 17:42:35 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:35 INFO HoodieStreamer: Creating Hudi Streamer with configs:
auto.offset.reset: earliest
bootstrap.servers: localhost:9092
hoodie.auto.adjust.lock.configs: true
hoodie.bulkinsert.shuffle.parallelism: 2
hoodie.datasource.hive_sync.batch_num: 10
hoodie.datasource.hive_sync.enable: true
hoodie.datasource.hive_sync.metastore.uris: thrift://localhost:9083
hoodie.datasource.hive_sync.mode: hms
hoodie.datasource.hive_sync.table: stock_ticks_cow_2
hoodie.datasource.meta.sync.enable: true
hoodie.datasource.write.partitionpath.field: date
hoodie.datasource.write.reconcile.schema: false
hoodie.datasource.write.recordkey.field: key
hoodie.delete.shuffle.parallelism: 2
hoodie.insert.shuffle.parallelism: 2
hoodie.streamer.schemaprovider.registry.url: http://localhost:8081/subjects/stock_ticks-value/versions/latest
hoodie.streamer.source.kafka.topic: stock_ticks
hoodie.upsert.shuffle.parallelism: 2
schema.registry.url: http://localhost:8081

25/03/30 17:42:35 INFO HoodieSparkKeyGeneratorFactory: The value of hoodie.datasource.write.keygenerator.type is empty; inferred to be SIMPLE
25/03/30 17:42:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:35 INFO HoodieTableConfig: Loading table properties from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2/.hoodie/hoodie.properties
25/03/30 17:42:35 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:35 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20250330173718165__commit__COMPLETED__20250330173723152]}
25/03/30 17:42:35 INFO HoodieIngestionService: Ingestion service starts running in run-once mode
25/03/30 17:42:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:35 INFO HoodieTableConfig: Loading table properties from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2/.hoodie/hoodie.properties
25/03/30 17:42:35 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:35 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20250330173718165__commit__COMPLETED__20250330173723152]}
25/03/30 17:42:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:35 INFO HoodieTableConfig: Loading table properties from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2/.hoodie/hoodie.properties
25/03/30 17:42:35 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
25/03/30 17:42:36 INFO StreamSync: Checkpoint to resume from : Option{val=stock_ticks,0:3482}
25/03/30 17:42:36 INFO KafkaOffsetGen: SourceLimit not configured, set numEvents to default value : 5000000
25/03/30 17:42:36 INFO KafkaOffsetGen: getNextOffsetRanges set config hoodie.streamer.source.kafka.minPartitions to 0
25/03/30 17:42:36 INFO ConsumerConfig: ConsumerConfig values: 
	allow.auto.create.topics = true
	auto.commit.interval.ms = 5000
	auto.offset.reset = earliest
	bootstrap.servers = [localhost:9092]
	check.crcs = true
	client.dns.lookup = use_all_dns_ips
	client.id = consumer-null-1
	client.rack = 
	connections.max.idle.ms = 540000
	default.api.timeout.ms = 60000
	enable.auto.commit = true
	exclude.internal.topics = true
	fetch.max.bytes = 52428800
	fetch.max.wait.ms = 500
	fetch.min.bytes = 1
	group.id = null
	group.instance.id = null
	heartbeat.interval.ms = 3000
	interceptor.classes = []
	internal.leave.group.on.close = true
	internal.throw.on.fetch.stable.offset.unsupported = false
	isolation.level = read_uncommitted
	key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
	max.partition.fetch.bytes = 1048576
	max.poll.interval.ms = 300000
	max.poll.records = 500
	metadata.max.age.ms = 300000
	metric.reporters = []
	metrics.num.samples = 2
	metrics.recording.level = INFO
	metrics.sample.window.ms = 30000
	partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
	receive.buffer.bytes = 65536
	reconnect.backoff.max.ms = 1000
	reconnect.backoff.ms = 50
	request.timeout.ms = 30000
	retry.backoff.ms = 100
	sasl.client.callback.handler.class = null
	sasl.jaas.config = null
	sasl.kerberos.kinit.cmd = /usr/bin/kinit
	sasl.kerberos.min.time.before.relogin = 60000
	sasl.kerberos.service.name = null
	sasl.kerberos.ticket.renew.jitter = 0.05
	sasl.kerberos.ticket.renew.window.factor = 0.8
	sasl.login.callback.handler.class = null
	sasl.login.class = null
	sasl.login.refresh.buffer.seconds = 300
	sasl.login.refresh.min.period.seconds = 60
	sasl.login.refresh.window.factor = 0.8
	sasl.login.refresh.window.jitter = 0.05
	sasl.mechanism = GSSAPI
	security.protocol = PLAINTEXT
	security.providers = null
	send.buffer.bytes = 131072
	session.timeout.ms = 10000
	socket.connection.setup.timeout.max.ms = 30000
	socket.connection.setup.timeout.ms = 10000
	ssl.cipher.suites = null
	ssl.enabled.protocols = [TLSv1.2]
	ssl.endpoint.identification.algorithm = https
	ssl.engine.factory.class = null
	ssl.key.password = null
	ssl.keymanager.algorithm = SunX509
	ssl.keystore.certificate.chain = null
	ssl.keystore.key = null
	ssl.keystore.location = null
	ssl.keystore.password = null
	ssl.keystore.type = JKS
	ssl.protocol = TLSv1.2
	ssl.provider = null
	ssl.secure.random.implementation = null
	ssl.trustmanager.algorithm = PKIX
	ssl.truststore.certificates = null
	ssl.truststore.location = null
	ssl.truststore.password = null
	ssl.truststore.type = JKS
	value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer

25/03/30 17:42:36 WARN ConsumerConfig: The configuration 'schema.registry.url' was supplied but isn't a known config.
25/03/30 17:42:36 INFO AppInfoParser: Kafka version: 2.8.0
25/03/30 17:42:36 INFO AppInfoParser: Kafka commitId: ebb1d6e21cc92130
25/03/30 17:42:36 INFO AppInfoParser: Kafka startTimeMs: 1743356556089
25/03/30 17:42:36 INFO Metadata: [Consumer clientId=consumer-null-1, groupId=null] Cluster ID: Nk-xOeixRZGj41miDeXdjQ
25/03/30 17:42:36 INFO Metrics: Metrics scheduler closed
25/03/30 17:42:36 INFO Metrics: Closing reporter org.apache.kafka.common.metrics.JmxReporter
25/03/30 17:42:36 INFO Metrics: Metrics reporters closed
25/03/30 17:42:36 INFO AppInfoParser: App info kafka.consumer for consumer-null-1 unregistered
25/03/30 17:42:36 INFO KafkaOffsetGen: final ranges [OffsetRange(topic: 'stock_ticks', partition: 0, range: [3482 -> 3482])]
25/03/30 17:42:36 INFO KafkaSource: About to read sourceLimit 9223372036854775807 in 0 spark partitions from kafka for topic stock_ticks with offset ranges [OffsetRange(topic: 'stock_ticks', partition: 0, range: [3482 -> 3482])]
25/03/30 17:42:36 INFO KafkaSource: About to read 0 from Kafka for topic :stock_ticks
25/03/30 17:42:36 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20250330173718165__commit__COMPLETED__20250330173723152]}
25/03/30 17:42:36 INFO UtilHelpers: Adding overridden properties to file properties.
25/03/30 17:42:36 INFO StreamSync: No new data, source checkpoint has not changed. Nothing to commit. Old checkpoint=(Option{val=stock_ticks,0:3482}). New Checkpoint=(stock_ticks,0:3482)
25/03/30 17:42:36 INFO StreamSync: Shutting down embedded timeline server
25/03/30 17:42:36 INFO HoodieIngestionService: Ingestion service (run-once mode) has been shut down.
25/03/30 17:42:36 INFO SparkContext: SparkContext is stopping with exitCode 0.
25/03/30 17:42:36 INFO SparkUI: Stopped Spark web UI at http://10.0.0.108:8091
25/03/30 17:42:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
25/03/30 17:42:36 INFO MemoryStore: MemoryStore cleared
25/03/30 17:42:36 INFO BlockManager: BlockManager stopped
25/03/30 17:42:36 INFO BlockManagerMaster: BlockManagerMaster stopped
25/03/30 17:42:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
25/03/30 17:42:36 INFO SparkContext: Successfully stopped SparkContext
25/03/30 17:42:36 INFO ShutdownHookManager: Shutdown hook called
25/03/30 17:42:36 INFO ShutdownHookManager: Deleting directory /tmp/spark-37076236-cc75-4ba3-a7bc-65a0778326a0
25/03/30 17:42:36 INFO ShutdownHookManager: Deleting directory /tmp/spark-8b36c157-3895-45ce-86b2-5a063c272795
@danny0405 danny0405 added meta-sync hudistreamer issues related to Hudi streamer (Formely deltastreamer) hive Issues related to hive labels Mar 31, 2025
@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Mar 31, 2025
@ad1happy2go
Copy link
Collaborator

@Souldiv It dont have new data in the batch, so it skips the commit and sync
25/03/30 17:42:36 INFO StreamSync: No new data, source checkpoint has not changed. Nothing to commit. Old checkpoint=(Option{val=stock_ticks,0:3482}). New Checkpoint=(stock_ticks,0:3482)

@Souldiv
Copy link
Author

Souldiv commented Apr 1, 2025

@Souldiv It dont have new data in the batch, so it skips the commit and sync
25/03/30 17:42:36 INFO StreamSync: No new data, source checkpoint has not changed. Nothing to commit. Old checkpoint=(Option{val=stock_ticks,0:3482}). New Checkpoint=(stock_ticks,0:3482)

If there is no metadata present in hive shouldn't it sync regardless if there was new incoming data or not?

@danny0405
Copy link
Contributor

There is option to commit the empty commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hive Issues related to hive hudistreamer issues related to Hudi streamer (Formely deltastreamer) meta-sync
Projects
Status: Awaiting Triage
Development

No branches or pull requests

3 participants