What I’ve learned as a Marketing Data Engineer

Annie Beltrão
9 min readNov 6, 2022

--

A brief review about the main marketing APIs and marketing data features that influence standard data pipeline concepts.

Marketing data is a very important tool to improve performance guided decisions, build new marketing strategies and provide insights to improve results, however the difficulties to build a consistent ETL pipeline are very low documented specially about the transactional feature, that requires a custom pipeline and validation strategies to ensure quality, accuracy and confidence.

There are some concepts that are important to keep in mind when you are building a data pipeline that will be approached here in the following order: data modeling and cardinality or aggregation levels; transactional feature or attribution widow, taxonomy and how we can use this to split data into different segments; data validation and quality assurance, marketing API, data enrichment with Google Analytics and finally the problems that you will find such as manually filled columns, campaign tests splitted rows.

Aggregation Levels

Marketing data is organized into layers or levels of aggregation: account, campaigns, ad groups and ads depending on the source, for example, Facebook reaches the ad level, Google Ads reaches only the ad group level, some sources can have additional levels, like LinkedIn, that has the campaign group. For reporting purposes, it’s helpful to think of each level as a collection of the following one, like a pyramid, as you can see in the image below.

Marketing Data Organization Levels

Once you understood the levels, it is important to understand the marketing data modeling or dimension x metrics modeling, the data is grouped based on the level of choice, this means that, the performance metrics are calculated based on aggregation columns and the level (dimensions or also called breakdowns). This is related to the data granularity, so raise questions about extraction performance, data storage, API request limits and database modeling.

Taxonomy

Taxonomy refers to hierarchical classifications and naming conventions aimed at segmentation and objectivity of marketing assets. Taxonomies standardize data flows across attribution and integration platform to ensure that your solution’s analyses, insights, and recommendations directly map to targeted objectives. In this way, creating a taxonomy helps you categorize your data so that you can use it in a functional way and allows you to analyze the performance of the media dimensions that most matter to your business, so taxonomy means your data is categorized consistently across multiple sources and channels, ensuring data readability. Every taxonomy has a tiered structure. The company structure and your reporting needs to determine which terms you choose for each level, such as funnel stage, campaign objective, keywords and campaign source. Not defining a standard taxonomy implies manual work and little traceability of performance indicators.

From the data engineer point of view, defining a well-structured and standard taxonomy, helps navigate through aggregation levels, segment data by source, medium, objective, funnel stages, location, keywords in an easy and accurate way, so the main features that need to revise are good delimiter and patterns to locate and write each terms inside the taxonomy structures. That’s why the data engineer and the marketing team, needs to work together to define and keep the best practices.

Marketing Taxonomy Patterns Example

Marketing data is based on a lot of marketing business rules, what implies that some columns will be manually filled, so as you need to think about your data pipeline will deal with the same taxonomy written with upper or lower case, campaign tests that split rows and generates the same campaign with different ids, all those problems lead back to the taxonomy point where you need to be very align about how to create an standard taxonomy.

Attribution Window

An attribution window or conversion window is a defined period of time in which a publisher can claim that a click or impression led to a conversion. For example, an advertiser sets an attribution window of seven days. If a user clicks on a publisher’s ad converts to a lead within that time, that publisher receives the credit.

Attribution windows are an essential tool for helping advertisers and publishers to understand when a conversion takes place because often there’s a gap in exposure to an advertisement and the conversion, this is important to avoid users been mistaken as organic, instead of being paid for, which would cause publishers to lose money. As a default, clicks are tracked over a seven-day window, and impressions for 24 hours. However, your attribution window is customizable up to thirty days for install, which is more than enough time to allow for a majority of interactions.

For the data engineer point of view, an attribution window implies in a very transactional data, because is this defined period the metrics values will change every day until the stabilization at the end of the defined attribution window, so it is import to have a pipeline based on extraction redundancy, instead of a D-1 extraction, you will need to have a D- length of the attribution window. For example, Facebook has a default attribution window of 7 days, so during the last 7 days new extractions needs to be done to keep up with the metrics values changes.

Metrics values changing over time based on the attribution window length

Marketing APIs

Facebook

Facebook is the only API that reaches the lowest aggregation level, ad level, this implies a greater volume of data because of the maximum granularity, so requests have to be small or else they may fall into limits on the amount of data per request. The Facebook Insights endpoint retrieves ads data grouped by the level and breakdowns, those breakdowns are predefined and can be by published platform, where the data is splitted by Instagram, Facebook, Messenger, Ad Network and Unknown, or they can be splitted by geographical columns or age and gender columns, besides others.

Facebook API fields are segmented following the documentations parameters. The retrieved data is grouped by breakdowns columns and calculates metrics based on the breakdowns, once the data is segmented, some fields are not valid when combined.

The request has a limit of 15 fields per request following a list structure. One of the greatest problems with Facebook API is the limits, you have limitations of fields quantities, data volumes by request and volume of request per time, once you reach the last one, the API remains blocked for 1 hour.

Google ADS

Google ADS API allows you to create request using query syntaxes, the API structure is very intuitive, and the data segmentation and fields compactibility is abstracted by different tables, the API only reaches until the ad group level, and you can mount and validate the request query using the API web tool in the documentation. The fields are divided into 3 main categories, segments, that groups fields usually used in the where clause, like dates, predefined date range, as well as another filters. Attribute Resource Fields there groups aggregations columns and finally metrics calculated based on the Attribute Resource Fields chosen. Like Facebook, Google ADS has fields compactibility, but is easy to deal because the incompatible fields are not choosable in the query builder.

About the metrics fields, when they are typed as FLOAT value in the API, they demand a transformation because they come as integers, therefore, it is necessary to divide by 1,000,000. Those fields usually come with a call sign in the name and/or description, for example = ‘costMicros’. When you’re dealing with Geolocation tables, the retrieved state and city names come as IDs and need to be converted to the correct nomenclature using a CSV.

LinkedIn

LinkedIn is the hardest API to deal because the documentation is missing a lot of information, the python code is based on a python request where the parameters are passed in the URL. You defined the time range, the aggregation level (pivot), the account, as well as the metrics. The request URL received two list of metrics fields, one called projections and another one called fields, projection controls how much of an entity’s data is displayed in response. All APIs have a default set of field projections, but for LinkedIn, this is customizable. Field projections are defined using the &fields= query parameter that receives a comma-separated list of field.

The aggregation columns are grouped in the pivotValue~, in this field you will find the campaign name, budget information, data range and more, so it is important to indicate this field in the projection list. Until the publication date of this article, the LinkedIn documentation says that the API has pagination, but the pagination parameters do not work, and without any pagination solution, all data is well retrieved.

Data Enrichment with Google Analytics V3

The Google Analytics API is different from the others mentioned above because this API aggregates new information to the data retrieved from those APIs, the GA API not follows a level, but the data is retrieved following the most granular level of each API, as well as can be grouped by columns (dimensions) specified by the request. So it is an API for data enrichment.

Google Analytics follows a structure of data grouped by dimensions and calculates metrics based on the dimensions, but it is import to know that the data in the API is segmented, so some fields of dimensions and metrics are not valid when combined to one another. To avoid request errors, you can use the dev-tools website to set the request fields following the limit of 7 dimensions and 10 metrics per request, considering the valid combinations of Dimensions and Metrics.

Sampling

The Google Analytics works with a limit of 500,000 sessions per account, if this limit were exceeded, the data will be released by according to a sampling subset of all data in order to recover meaningful information from a larger data set. Google Analytics might apply session sampling to trade-off time performance. So in the API request we had a parameter samplingLevel, that needs to be defined considering the trade-off of time and data recovering. This implies directly on the validation of the request, because the metrics values will be different from the ones you saw on the GA website manager, especially because the sampling model considers a period range to recalculates the metrics values.

So, for all API requests, the validation needs to follow the exact same parameters than the ones defined on the request, if the request is day-by-day, the validation needs to be done day-by-day on the GA webs manager. If you choose a different time range, the GA web manager will show different metrics values because of the sampling model. Another thing that implies on the sampling model and the request validation process is dimensions accumulation, when you request metrics calculation by a lot of aggregations columns, or a more granular data, you are increasing the data sampling model complexity, resulting in a more divergent data when comparing to the GA web manager. So it is important to consider this in the validation process as well.

Pagination

Considering all APIs mentioned above, Google Analytics is the only one that has pagination, it is important to define the pagination parameters to access all data inside the API or you will access only the firsts default max results. The pagination works by defining a maximum number of rows per page, max_results parameter, and a start index, start_index parameter, so for each requisition, the API will be iterated index by index recovering all rows per index, if you find an index with less than the defined number of max results, it means that the pagination is over.

--

--

Annie Beltrão

The most data scientist data engineer and machine learning researcher that you ever gonna find!