0

Best way to analyse & query S3 logs

Hi there,

I've setup S3 as a destination. I tried using AWS Athena and AWS Glue to try create a way for analysing or displaying the raw data better.

It doesn't seem to work though.

Does anyone have a nice clean way of doing this. I think data warehouses are a bit to costly and difficult to use.

Open to suggestions

Thanks!

Stefan 馃

10replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • Hi Stefan,

    Are you having difficulty building tables using Athena, or are you running into difficulties the data?

    Best,

    Xavier

    Reply Upvote
  • Xavier Agostini using AWS Glue the tables were made but when I try to run queries I keep getting errors from parsing the data, I can share the error message?

    Reply Upvote
  • Sure Stefan I can take a look at your errors. I'm not familiar with AWS Glue, but we use AWS Athena internally and can share how we create tables if you would like.

    Reply Upvote
  • Xavier Agostini Actually that would be ideal, if you just tell me how to set those tables up, Glue was mean to automagically do it by analysing the logs!

    Thanks!

    Reply Upvote
  • CREATE EXTERNAL TABLE IF NOT EXISTS <DATABASE_NAME>.<YOUR_TABLE_NAME> (
      `anonymousId` string,
      `context` map<string,string>,
      `integrations` map<string,string>,
      `messageId` string,
      `receivedAt` string,
      `timestamp` string,
      `sentAt` string,
      `type` string,
      `userId` string,
      `traits` map<string,string>,
      `event` string,
      `properties` map<string,string>,
      `name` string,
      `groupId` string,
      `previousId` string
    )
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    WITH SERDEPROPERTIES (
      'serialization.format' = '1', 'ignore.malformed.json' = 'true'
    ) LOCATION 's3://<BUCKET_PATH>/<SOURCE_ID>/<DATE>';

    You will need to specify a few values: database name, desired table name, the S3 bucket path, the Segment source ID the S3 destination is using, and the date time (expressed in UNIX time). This SQL statement will allow you to create a table for one day worth of your Segment S3 logs. After you have created the table, you can start querying your data!

    Best,

    Xavier

    Reply Upvote 1
  • Xavier Agostini Awesome! Thanks, although it seems to be `event`, `name`, `groupid`, `previousid` are all empty, any ideas?

    Also do I need to set a table up per day to analyse?

    Thanks!

    Reply Upvote
  • Hi  Stefan Pretty ,

    Those values will be empty depending on the API call. For example, event is only populated for track calls. If you want to analyze multiple days you will need to create a partition table, with each partition being a different date log.

    > http://docs.aws.amazon.com/athena/latest/ug/partitions.html

    Reply Upvote
  • Xavier Agostini Ok thank you!

    Reply Upvote
  • Xavier Agostini 

    Thanks for sharing your external table definition!

    I am making use of this code for reading data from S3 with redshift spectrum (which works very similarly to Athena). I was wondering if you could elaborate a bit on steps that come after this external table definition - namely, how do you handle the nested columns which are the "map" type columns (context, integrations, traits, properties)? do you have more logic you can share that deserializes these fields and then splits the table by "event" (similar to how tables are created when integrating segment directly with Redshift)?

    Your help here is much appreciated! Thanks!

    Reply Upvote
  • Hi  Noha Alon ,

    To handle nested columns in Athena you can use the json_extract_scalar method. For example to extract the library name an event was sent with from the event context object you could do:

      json_extract_scalar(context['library'], '$.name')

    I'm not sure how you would split the data into separate event tables using spectrum, unfortunately.


    Best,

    Xavier

    Reply Upvote