This post outlines the technical requirements for integrating your data for our initial sync and ongoing operations. It provides a detailed overview of our data science process and the rationale behind the required formats and characteristics that are essential for ensuring data quality and model accuracy.
Our goal is to help you optimize your advertising campaigns on various ad platforms using predictive Lifetime Value (pLTV) predictions, aiming to improve user acquisition and return on ad spend.
To achieve accurate and actionable pLTV predictions, we require specific data from you. This document serves as a technical guide specifying our data requirements
Our process transforms raw user data into pLTV predictions. The high-level steps are:
1. Initial Sync: We receive historical raw data from you for the first time. The onboarding guide provides a simple step-by-step walkthrough for this process.
2. Ongoing Sync (incremental updates): We receive recent data in incremental batches to keep the models current.
3. Model Training and Inference: We use historical data and extracted features to train predictive pLTV models, then apply the models to new, incoming users. A key focus is generating predictions within the first few hours of a user’s journey.
4. Signal Transmission: We send pLTV predictions to your connected ad platforms.
5. Campaign Performance and Attribution: We evaluate the impact of pLTV signals on campaign success. Attribution data is required to measure the effectiveness of predictions and align results with your existing attribution frameworks.
For effective pLTV modeling and accurate campaign evaluation, it's essential to have access to key data sources. These include user activity events, user attributes, campaign attribution data, and user identifiers. Without these inputs, we won't be able to build or maintain reliable models. The data we require should include:
- Key business interactions: including non-revenue events such as trial sign-ups, trial churns, support interactions, and membership enrollments.
- Event-level actions performed by users, such as adding to cart, completing a game level, signing up for a newsletter, and similar activities.
- Event-level telemetry data, including sessions, app opens, page views, and similar interactions.
Important notes:
- These should come in the shape of a timestamped event log (i.e Firebase/GA4 logs) with at minimum an event timestamp, event type/name and user id’s and a revenue column for revenue carrying events. Columns that add more context to events are also recommended.
- More is better. We generally want all information about each user
- User activity is used to generate features for modeling.
- Revenue events are used to define targets for model training and to evaluate campaign performance.
- These can come as columns in an event log or in a separate table
- Used to generate user features for predictive Lifetime Value (pLTV) modeling.
- Also used to build models for specific platforms/regions etc.
- Critical for linking user behavior with marketing activities and ensuring accurate attribution.
- The specific identifiers we need will depend on the advertising platform or MMP that we will use for the experiment. Common identifiers include device identifiers for apps (idfa/gaaid), emails, location based (IP, Country, Postal code) and MMP specific identifiers (Appsflyer id, Singular id…)
- PII such as emails and phone numbers should be hashed according to our guides, usually by creating a view and using sha256.
- Provides the source of truth for user campaign attribution.
- Attribution data that links users to specific marketing efforts, including MMP (Mobile Measurement Partner) tags, UTM parameters, or ad network identifiers.
- Essential for accurately evaluating campaign performance.
Depending on your data warehouse structure this could look different. All of this information could be a single event log table or could come from multiple sources in multiple tables. This is not a format requirement, but an overview of all the information we need.
We prefer the data to the rawest form, coming from the sources, without heavy transformations downstream. This way we can ensure there are no row mutations and information leakage from the future, which may negatively impact our models.
Important is that the following information is included:
- Identifiers: Unique identifiers are crucial for mapping between tables and tracking users. These identifiers should also be recognizable by ad platforms.
- Tags: Tags help us categorize and understand user behavior.
- User Data: Comprehensive data on user interactions, purchases, and engagement is essential.
- Attribution Data: Data that shows which campaigns led to conversions and user actions.
- Payment Data: key user data that includes revenue generating events
- Daily data refresh (minimum): Data must be updated at least daily. While real-time updates are not strictly required, a maximum latency of 24 hours for event and transaction data updates is necessary to maintain model relevance. The data files should be partitioned by date.
- Update Mechanism: Data should be provided via incremental updates (append-only), rather than full table refreshes. This is crucial for processing efficiency and maintaining historical fidelity. Furthermore, it is important that there are also no historical mutations of existing data points. Any changes should be appended.
- Data History: To provide a good model accuracy at least 3 months of historical data is required and optimal would be to get 12 months of historical data.
- Data Format: As we take care of the transformation, we require raw, source-level data with minimal pre-processing to prevent data mutations and potential leakage during the training process
- Timestamps: All event and transaction timestamp fields must be provided in UTC (Coordinated Universal Time) format. This prevents issues related to time zones and daylight saving when processing time-series data across different geographical locations.
- User Identification Consistency: A single, consistent identifier for each user (user ID) must be used across all provided tables (event tables, payments, attribution, and user mapping). If different identifiers exist, a user mapping table (or equivalent logic) must provide a clear and reliable link to the consistent user ID shared with us.
- Table Partitioning: For query performance optimization, tables must be partitioned. This aligns with the daily incremental update requirement.
To ensure a smooth and successful integration, please consider the following common issues:
- Inconsistent identifiers: Ensure that user identifiers are consistent across all data tables. Please provide your ID mapping logic between different tables.
- Missing Events: Provide as many relevant events as possible to help us understand user behaviour.
- Mutated Data: Avoid data mutation, as it increases the risk of training models on data that wouldn’t have been available at the time, and won’t be available at inference time.
- Full Refresh: Ensure that the tables are incrementally updated and that the tables are timestamp-based partitioned to optimise query performance.
- Attribution Misalignment: Clearly communicate your attribution model to ensure we align our measurements with yours.
--
In case you cannot make these data requirements work, we are able to connect you with a Churney Data Partner that can help you cleanup and organize the data: end-to-end, it takes up to 3 weeks. Please reach out to your Churney point of contact for more information.
Your data warehouse has incredible value. Our causal AI helps unlock it.