AWS S3 Storage

8 min

amazon s3 or amazon simple storage service is a service offered by amazon web services https //en wikipedia org/wiki/amazon web services (aws) that provides object storage https //en wikipedia org/wiki/object storage through a web service https //en wikipedia org/wiki/web service interface amazon s3 uses the same scalable storage infrastructure that amazon com https //en wikipedia org/wiki/amazon (company) uses to run its e commerce network amazon s3 can store any type of object, which allows uses like storage for internet applications, backups, disaster recovery, data archives, data lakes https //en wikipedia org/wiki/data lake for analytics, and hybrid cloud storage https //en wikipedia org/wiki/cloud computing#hybrid cloud our aws s3 storage datalakehouse io integration allows for s3 as a source and a target connection replicates aws s3 storage data to your cloud data warehouse target synchronizes to your target destination at a scheduled frequency replicate many connections directly to your aws s3 storage bucket it allows you to replicate/synchronize your s3 data files, including capturing snapshots of data at any point int time, and keep it up to date with little to no configuration efforts you don’t even need to prepare the target schema — datalakehouse io will automatically handle all the heavy lifting for you all you need is to specify the connection to your s3, point to your target system, or use a datalakehouse io managed data warehouse and datalakehouse io does the rest our support team can even help you set it up for you during a short technical on boarding session setup instructions datalakehouse io securely connects to your aws s3 storage using the form in the datalakehouse io portal please complete the following basic steps enter a name or alias for this connection, in the 'name/alias' field, that is unique from other connectors enter a 'target schema prefix', which will be the prefix for the schema at the target you will sync your data files into when used as a source only enter a 'bucket' name, where your files are stored typically starts with s3 // or https //, so enter just the name without the prefix for every file in the bucket a separate table will be created and loaded into the what is a target connection? docid 3wx 24ml25noxc1atdbs4 connected via a sync bridge select your 'region' enter your 'access key', credentials to access the bucket enter your 'secret key', credentials to access the bucket enter any other optional details in the available fields (see the setup video if you need help or contact support) folder path, is a prefix path on the root bucket from where desired files will be retrieved for json/gz files, that are stored within nested folders, each file(s) in the subfolder(s) will be inserted into the same target connection table as the parent folder the presumption is that the file structure is the same across all the files within the nested folders folder paths should always end with a forward slash (/) jinja usage basic jinja can be used for timestamp logic in the folder path to determine what the dynamic structure of the s3 bucket might be, for example as a time based solution, if looking in a folder prefix/bucket as such, 2025/11/10/myfiles/otherfiles/ , and the date structure is correctly used over time, then the following jinja expression can be used in the folder path field, {{ today() | format date('yyyy') }}/{{ today() | format date('mm') }}/ , to ensure that dlh io is dynamically retrieving from that folder each day file pattern, is a regular expression (regex) used to isolated only certain files to be retrieved the length of the regex is limited (100 characters) folder pattern, is a regular expression (regex) and jinja field which can only be used in certain circumstances, usually by dlh io support team when working with customers on special data integration user cases file type, allows for a pre determined type of file extension to be retrieved json files stored in gz compressed files will get ingested in the same manner as json files not stored in a gz file click the save & test button once your credentials are accepted you should be able to see a successful connection message appear how to setup gzip and compressed file handling in the option for compression there are several options gzip can contain multiple files it doesn't matter if the file types are all of a kind or different formats for loading as long as they are compatible structures to json, csv, etc and this depends on the selected file type and will parse only the file type selected (json, csv, etc ) gz compressed version of json (a single json file) if file type is json then it will be an unzipped json file only zip similar to how gzip performs as described above control each column data type sql transformations docid 5zjgrvbhtywqw8 olioh0 allow logic to be executed against a target connection based on a scheduled frequency or triggered event of new data on tables updated via datalakehouse io (dlh io) this especially helps when you want to control the data type set in your target connection security and other considerations s3 can be used as a source or target destination however, this should be used with great consideration to the impact of other systems that may conflict restrictions prevent one bucket from syncing to another bucket as this itnegration is not meant for big data transfer concepts and will fail if attempting to do so permissions attempt to use the get and put privileges for connecting and testing the connection to the s3 storage bucket in many cases one of the first test is to verify if your access allows for listing the bucket and or all files in the bucket this uses the listbucket privilege, s3\ listbucket failing to have one of these privileges set as part of your policy may cause your integration to fail an error will be present in the logs if that is the case if you seek some special permissions due to security constraints or otherwise please contact customer support docid\ pbtuxndqrdogoroejbgsv the recommendation is that your role or policy appears as follows if having full control over the s3 or s3 compatible bucket { "version" "2012 10 17", "statement" \[ { "effect" "allow", "action" \[ "s3\ putobject", "s3\ getobject", "s3\ deleteobject", "s3\ putobjectacl", "s3\ listbucket", "s3\ getbucketlocation" ], "resource" \[ "arn\ aws\ s3 \<bucket name> ", "arn\ aws\ s3 \<bucket name> / " ] } ] } issue handling if any issues occur with the authorization simply return to the sources page in datalakehouse io, edit the source details and click the 'save & test' button to confirm connectivity if any issues persist please contact our support team via the datalakehouse io support portal https //datalakehouse zendesk com

Files & Object Storage

Azure Blob Storage