生成式推荐模型的长序列特征：离线存储

最新推荐文章于 2026-06-19 23:05:55 发布

原创最新推荐文章于 2026-06-19 23:05:55 发布 · 1k 阅读

28 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#大数据 #生成式推荐 #推荐算法 #特征工程

文章目录

参考资料

长序列特征的例子

For example, a user’s sequence could look like this:
[ Electronics, Clothing, Books, Home & Kitchen, Electronics, Books, Electronics, Sports & Outdoors, Electronics ]
The interactions could be further refined by adding the type of behavior (e.g., [Electronics:view, Clothing:click, Books:purchase, Home & Kitchen:add_to_cart, …] ).

1. Event-level features

Categorical Encoding: Convert event types (e.g., “click”, “add to cart”, “purchase”, “view”) or item categories into numerical representations using techniques like one-hot encoding or embedding methods.
Temporal Features: Extract time-based features from timestamps, such as hour of day, day of the week, month, and time elapsed since previous interaction.
Interaction-Specific Features: Capture attributes specific to each interaction, like product price, rating, duration of video watched, etc.

2. Sequence-level features

Aggregation Features

Count of specific events: Number of clicks, purchases, or searches in the past week.
Average value of numerical features: Average product price of items viewed or purchased.
Time-based statistics: Maximum, minimum, or average time between consecutive interactions.
Frequency of interactions: Number of interactions per hour or day.

Session-based Features

Session length: Number of events or duration of the session.
Session activity type: Percentage of clicks, purchases, or searches within the session.
Sequence of items/events within a session: Representing the order of actions taken by the user, for example, viewing product A, then B, then adding B to the cart.

Temporal Order Features

Lag features: Features from previous interactions (e.g., the last item viewed, the type of the second-to-last event). GeeksforGeeks notes that lag features are a fundamental technique for time-series data.
Positional embeddings: Add positional information to sequence embeddings to capture the order of events.

3. User-level features

Long-Term Preference Features: Summarize user preferences over a long period:
Most frequently purchased categories: Top categories a user interacts with.
Overall spending patterns: Average purchase value, total purchases, etc.
Average interaction count: Average number of interactions per day or week.
User Embeddings

4. Interaction features (between user and item/context)

User-Item Similarity: Calculate the similarity between the current item and previous items the user interacted with.
Time Since Last Interaction with Item: Capturing recency of interest in a particular item.

how to store the long term user behaviro sequence features in offline data lake storage?

Schema design： see following
File formats：Columnar Formats (Parquet or ORC)
Partitioning strategies：Date-Based Partitioning，User ID/Device ID Partitioning
Data ingestion and processing：Batch Ingestion，Data Enrichment and Transformation
Lifecycle management and cost optimization：Retention Policies

Schema design:

user_id: string
name: string
gender: string
behavior_sequence: array<
    struct<
        timestamp: timestamp,
        category_id: int,
        action_type: string,
        product_id: string,
        price: double
    >
>

how to update this behavior_sequence field efficiently when there is new behavior for the same user?

Merge Operations (Upserts/MERGE SQL): This allows you to efficiently update existing records (the user_id and its behavior_sequence) and insert new ones

MERGE INTO target_delta_table AS target
USING source_data AS source
ON target.user_id = source.user_id
WHEN MATCHED THEN
  UPDATE SET target.behavior_sequence = array_append(target.behavior_sequence, source.new_behavior)
WHEN NOT MATCHED THEN
  INSERT (user_id, name, behavior_sequence)
  VALUES (source.user_id, source.name, array(source.new_behavior))

参考资料

和Google的对话记录