Skip to main content
Version: Next

DynamoDB

Testing

Important Capabilities

CapabilityStatusNotes
Detect Deleted EntitiesOptionally enabled via stateful_ingestion.remove_stale_metadata
Platform InstanceBy default, platform_instance will use the AWS account id

This plugin extracts the following:

AWS DynamoDB table names with their region, and infer schema of attribute names and types by scanning the table

Prerequisities

In order to execute this source, you will need to create access key and secret keys that have DynamoDB read access. You can create these policies and attach to your account or can ask your account admin to attach these policies to your account.

For access key permissions, you can create a policy with permissions below and attach to your account, you can find more details in Managing access keys for IAM users

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"iam:ListAccessKeys",
"iam:CreateAccessKey",
"iam:UpdateAccessKey",
"iam:DeleteAccessKey"
],
"Resource": "arn:aws:iam::${aws_account_id}:user/${aws:username}"
}
]
}

For DynamoDB read access, you can simply attach AWS managed policy AmazonDynamoDBReadOnlyAccess to your account, you can find more details in Attaching a policy to an IAM user group

CLI based Ingestion

Install the Plugin

pip install 'acryl-datahub[dynamodb]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: dynamodb
config:
platform_instance: "AWS_ACCOUNT_ID"
aws_access_key_id: "${AWS_ACCESS_KEY_ID}"
aws_secret_access_key: "${AWS_SECRET_ACCESS_KEY}"
# User could use the below option to provide a list of primary keys of a table in dynamodb format,
# those items from given primary keys will be included when we scan the table.
# For each table we can retrieve up to 16 MB of data, which can contain as many as 100 items.
# We'll enforce the the primary keys list size not to exceed 100
# The total items we'll try to retrieve in these two scenarios:
# 1. If user don't specify include_table_item: we'll retrieve up to 100 items
# 2. If user specifies include_table_item: we'll retrieve up to 100 items plus user specified items in
# the table, with a total not more than 200 items
# include_table_item:
# table_name:
# [
# {
# "partition_key_name": { "attribute_type": "attribute_value" },
# "sort_key_name": { "attribute_type": "attribute_value" },
# },
# ]

sink:
# sink configs

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
aws_access_key_id 
string
AWS Access Key ID.
aws_secret_access_key 
string(password)
AWS Secret Key.
include_table_item
map(str,array)
platform_instance
string
The instance of the platform that all assets produced by this recipe belong to
env
string
The environment that all assets produced by this connector belong to
Default: PROD
table_pattern
AllowDenyPattern
regex patterns for tables to filter in ingestion.
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
table_pattern.allow
array(string)
table_pattern.deny
array(string)
table_pattern.ignoreCase
boolean
Whether to ignore case sensitivity during pattern matching.
Default: True
stateful_ingestion
StatefulStaleMetadataRemovalConfig
Base specialized config for Stateful Ingestion with stale metadata removal capability.
stateful_ingestion.enabled
boolean
The type of the ingestion state provider registered with datahub.
Default: False
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Limitations

For each region, the list table operation returns maximum number 100 tables, we need to further improve it by implementing pagination for listing tables

Advanced Configurations

Using include_table_item config

If there are items that have most representative fields of the table, user could use the include_table_item option to provide a list of primary keys of a table in dynamodb format, those items from given primary keys will be included when we scan the table.

Take AWS DynamoDB Developer Guide Example tables and data as an example, if user has a table Reply with composite primary key Id and ReplyDateTime, user can use include_table_item to include 2 items as following:

Example:

# put the table name and composite key in DynamoDB format
include_table_item:
Reply:
[
{
"ReplyDateTime": { "S": "2015-09-22T19:58:22.947Z" },
"Id": { "S": "Amazon DynamoDB#DynamoDB Thread 1" },
},
{
"ReplyDateTime": { "S": "2015-10-05T19:58:22.947Z" },
"Id": { "S": "Amazon DynamoDB#DynamoDB Thread 2" },
},
]

Code Coordinates

  • Class Name: datahub.ingestion.source.dynamodb.dynamodb.DynamoDBSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for DynamoDB, feel free to ping us on our Slack.