生成式推荐模型的长序列特征:离线存储

长序列特征的例子

For example, a user’s sequence could look like this:
[ Electronics, Clothing, Books, Home & Kitchen, Electronics, Books, Electronics, Sports & Outdoors, Electronics ]
The interactions could be further refined by adding the type of behavior (e.g., [Electronics:view, Clothing:click, Books:purchase, Home & Kitchen:add_to_cart, …] ).

1. Event-level features

Categorical Encoding: Convert event types (e.g., “click”, “add to cart”, “purchase”, “view”) or item categories into numerical representations using techniques like one-hot encoding or embedding methods.
Temporal Features: Extract time-based features from timestamps, such as hour of day, day of the week, month, and time elapsed since previous interaction.
Interaction-Specific Features: Capture attributes specific to each interaction, like product price, rating, duration of video watched, etc.

2. Sequence-level features

Aggregation Features

Count of specific events: Number of clicks, purchases, or searches in the past week.
Average value of numerical features: Average product price of items viewed or purchased.
Time-based statistics: Maximum, minimum, or average time between consecutive interactions.
Frequency of interactions: Number of interactions per hour or day.

Session-based Features

Session length: Number of events or duration of the session.
Session activity type: Percentage of clicks, purchases, or searches within the session.
Sequence of items/events within a session: Representing the order of actions taken by the user, for example, viewing product A, then B, then adding B to the cart.

Temporal Order Features

Lag features: Features from previous interactions (e.g., the last item viewed, the type of the second-to-last event). GeeksforGeeks notes that lag features are a fundamental technique for time-series data.
Positional embeddings: Add positional information to sequence embeddings to capture the order of events.

3. User-level features

Long-Term Preference Features: Summarize user preferences over a long period:
Most frequently purchased categories: Top categories a user interacts with.
Overall spending patterns: Average purchase value, total purchases, etc.
Average interaction count: Average number of interactions per day or week.
User Embeddings

4. Interaction features (between user and item/context)

User-Item Similarity: Calculate the similarity between the current item and previous items the user interacted with.
Time Since Last Interaction with Item: Capturing recency of interest in a particular item.

how to store the long term user behaviro sequence features in offline data lake storage?

  1. Schema design: see following
  2. File formats:Columnar Formats (Parquet or ORC)
  3. Partitioning strategies:Date-Based Partitioning,User ID/Device ID Partitioning
  4. Data ingestion and processing:Batch Ingestion,Data Enrichment and Transformation
  5. Lifecycle management and cost optimization:Retention Policies

Schema design:

user_id: string
name: string
gender: string
behavior_sequence: array<
    struct<
        timestamp: timestamp,
        category_id: int,
        action_type: string,
        product_id: string,
        price: double
    >
>

how to update this behavior_sequence field efficiently when there is new behavior for the same user?

Merge Operations (Upserts/MERGE SQL): This allows you to efficiently update existing records (the user_id and its behavior_sequence) and insert new ones

MERGE INTO target_delta_table AS target
USING source_data AS source
ON target.user_id = source.user_id
WHEN MATCHED THEN
  UPDATE SET target.behavior_sequence = array_append(target.behavior_sequence, source.new_behavior)
WHEN NOT MATCHED THEN
  INSERT (user_id, name, behavior_sequence)
  VALUES (source.user_id, source.name, array(source.new_behavior))

参考资料

和Google的对话记录

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值