0

Best way to analyse & query S3 logs

Hi there,

I've setup S3 as a destination. I tried using AWS Athena and AWS Glue to try create a way for analysing or displaying the raw data better.

It doesn't seem to work though.

Does anyone have a nice clean way of doing this. I think data warehouses are a bit to costly and difficult to use.

Open to suggestions

Thanks!

Stefan 馃

17replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • Hi Stefan,

    Are you having difficulty building tables using Athena, or are you running into difficulties the data?

    Best,

    Xavier

    Reply Upvote
  • Xavier Agostini using AWS Glue the tables were made but when I try to run queries I keep getting errors from parsing the data, I can share the error message?

    Reply Upvote
  • Sure Stefan I can take a look at your errors. I'm not familiar with AWS Glue, but we use AWS Athena internally and can share how we create tables if you would like.

    Reply Upvote
  • Xavier Agostini Actually that would be ideal, if you just tell me how to set those tables up, Glue was mean to automagically do it by analysing the logs!

    Thanks!

    Reply Upvote
  • CREATE EXTERNAL TABLE IF NOT EXISTS <DATABASE_NAME>.<YOUR_TABLE_NAME> (
      `anonymousId` string,
      `context` map<string,string>,
      `integrations` map<string,string>,
      `messageId` string,
      `receivedAt` string,
      `timestamp` string,
      `sentAt` string,
      `type` string,
      `userId` string,
      `traits` map<string,string>,
      `event` string,
      `properties` map<string,string>,
      `name` string,
      `groupId` string,
      `previousId` string
    )
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    WITH SERDEPROPERTIES (
      'serialization.format' = '1', 'ignore.malformed.json' = 'true'
    ) LOCATION 's3://<BUCKET_PATH>/<SOURCE_ID>/<DATE>';

    You will need to specify a few values: database name, desired table name, the S3 bucket path, the Segment source ID the S3 destination is using, and the date time (expressed in UNIX time). This SQL statement will allow you to create a table for one day worth of your Segment S3 logs. After you have created the table, you can start querying your data!

    Best,

    Xavier

    Reply Upvote 1
  • Xavier Agostini Awesome! Thanks, although it seems to be `event`, `name`, `groupid`, `previousid` are all empty, any ideas?

    Also do I need to set a table up per day to analyse?

    Thanks!

    Reply Upvote
  • Hi  Stefan Pretty ,

    Those values will be empty depending on the API call. For example, event is only populated for track calls. If you want to analyze multiple days you will need to create a partition table, with each partition being a different date log.

    > http://docs.aws.amazon.com/athena/latest/ug/partitions.html

    Reply Upvote
  • Xavier Agostini Ok thank you!

    Reply Upvote
  • Xavier Agostini 

    Thanks for sharing your external table definition!

    I am making use of this code for reading data from S3 with redshift spectrum (which works very similarly to Athena). I was wondering if you could elaborate a bit on steps that come after this external table definition - namely, how do you handle the nested columns which are the "map" type columns (context, integrations, traits, properties)? do you have more logic you can share that deserializes these fields and then splits the table by "event" (similar to how tables are created when integrating segment directly with Redshift)?

    Your help here is much appreciated! Thanks!

    Reply Upvote
  • Hi  Noha Alon ,

    To handle nested columns in Athena you can use the json_extract_scalar method. For example to extract the library name an event was sent with from the event context object you could do:

      json_extract_scalar(context['library'], '$.name')

    I'm not sure how you would split the data into separate event tables using spectrum, unfortunately.


    Best,

    Xavier

    Reply Upvote
  • Xavier Agostini - I'm trying to analyze segment logs from S3 bucket. But looks like the .gz files created by segment on s3 bucket are not in proper json array format. It seems to be missing outer array brackets ( [ ] ) and also comma (,) as  record separator between two json records. Is there some configuration which would render these .gz segment logs in proper json array format. Let me know if you need any example file data. 

    Reply Upvote
  • Stefan Pretty Noha Alon  - Did you came across above issue (the .gz segment logs files are not being in proper json array format) while analyze the segment logs from S3 bucket. I am also making use of Redshift spectrum to read data from S3, but the files needs to be correct json format for jsonserde to read it. Can you please provide any inputs. For example, below is the sample data from one of the .gz file.

    {"anonymousId":"abc","event":"login","timestamp":"2018-09-27T17:35:55.835Z","version":2}
    {"anonymousId":"def","timestamp":"2018-09-27T17:35:55.835Z","Id":"faf91826-ebc9-4242-996f-d52969bec2d5","version":2}
    {"anonymousId":"xyz","event":"LandingPage","timestamp":"2019-01-21T05:14:22.998Z","Id":"88016b33-72d7-458e-8de8-f76241f4b681","version":2} 
    Reply Upvote
  • Hi Vikram,

    Our S3 logs are stored in JSON line format (http://jsonlines.org/). Here is an example Athena query you can run to analyze the data:

    CREATE EXTERNAL TABLE IF NOT EXISTS <DB_NAME>.<YOUR_TABLE_NAME> (
      `anonymousId` string,
      `context` map<string,string>,
      `integrations` map<string,string>,
      `messageId` string,
      `receivedAt` string,
      `timestamp` string,
      `sentAt` string,
      `type` string,
      `userId` string,
      `traits` map<string,string>,
      `event` string,
      `properties` map<string,string>,
      `name` string,
      `groupId` string,
      `previousId` string
    )
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    WITH SERDEPROPERTIES (
      'serialization.format' = '1', 'ignore.malformed.json' = 'true'
    ) LOCATION 's3://<BUCKET_NAME>/<SOURCE_ID>/<DATE>';

    (Date is a unix timestamp)

    Best,

    Xavier

    Reply Upvote 1
  • Xavier Agostini  Thanks for reply! So do you manually create table for each date (based on unix timestamp) or athena does that ?

    In our use case , we would like build separate tables for different events. Can we read the segment S3 logs (stored in json line format in .gz files) and move the json objects associated with specific events at different location on s3. So we can create table for each specific event by providing s3 file path for that event

    LOCATION 's3://<BUCKET_NAME>/<SOURCE_ID>/<EVENT>'
    Reply Upvote
  • Vikram with Athena you can create a partition table to get data over multiple dates. The S3 files contain all the events for a specific date, so you would need to parse them to rebuild the files to be based on a folder folder scheme

    Reply Upvote
  • Xavier Agostini  thought of checking how many times in a day you run queries against athena table. Did you tried converting segment logs into parquet format to check performance and pricing difference (between querying json .gz files and parquet files). Would like to know any pricing benefits when querying segment logs through parquet files. (through columnar storage and compression level parquet can give)

    Reply Upvote
  • Stefan Pretty Noha Alon  - either anyone of you tried converting Segment S3 logs (which are stored in JSON line format) into parquet format to evaluate pricing benefits when querying segment logs through parquet files. (through columnar storage and compression level parquet can give). 

    Reply Upvote