How to Read a Csv File and Writing Into S3 Bucket Using Python
How to Efficiently Transform a CSV File and Upload it in Compressed Grade to AWS S3 (Python, Boto3)
If you have been working in Information Engineering space, chances are that you would've been involved in processing CSV files. Fifty-fifty though its not the virtually efficient format for analytics but CSV format still enjoys quite a pregnant footprint in the current data landscape. Information technology's a widely supported format and is generally encountered when a source system (eastward.g. databases) provides information in the grade of files which are meant to be ingested to a information-lake (and subsequently to specialized stores due east.g. data warehouse to serve specific employ-cases). If you are an AWS shop and have a requirement to procedure the CSV files and upload them to S3, this article covers 1 of the approaches of doing and then employing Python and AWS SDK (boto3). Though one can approach this problem in so many ways but a few considerations are worth-highlighting to make it a rather feasible solution:
- Compression prior to ingestion — Networks are more often than not the major bottlenecks in whatsoever arrangement's compages. On the other hand, CSV format, isn't that efficient in terms of pinch and thus are usually beefy in size when un-compressed. Thus, ingesting a beefy CSV file to AWS S3 can be a rather costly performance. It is e'er recommended to resort to compression prior to ingesting data. Now, there tin be multiple considerations in terms of choosing the right compression format depending upon how the down-stream processing will be done e.one thousand. if y'all are using Large Data tools like Spark, it's recommended to utilize a splittable pinch format. For the sake of simplicity, I am using gzip as the pinch format.
- Avoiding I/Os in ETL — Wherever possible, it is recommended to avert disk I/O in your ETL processes. Disk I/O is considerably expensive/tedious operation thus if at that place is anything that you can do in-memory, it ultimately helps in the overall optimization of your approach.
With these fundamental considerations in place, lets await at an example of how you can transform a CSV file, shrink it (on-the-fly or in-retention) and upload it S3 via Python's native libraries and AWS SDK (boto3).
Requirements:
- AWS Account
- IAM user
- S3 Bucket
- Python
- Boto3
Lets assume that you have a unproblematic CSV file that looks like this:
sensor_code,parameter_name,start_timestamp,finish_timestamp,sensor_value DEX123,OCO_1,xxx/09/2021 5:00:00 PM,1/10/2021 1:00:00 AM,0.3008914 DEX123,OCO_1,30/09/2021 half dozen:00:00 PM,ane/x/2021 2:00:00 AM,0.2821953 DEX123,OCO_1,30/09/2021 7:00:00 PM,1/10/2021 iii:00:00 AM,0.2513988 DEX123,OCO_1,30/09/2021 8:00:00 PM,ane/ten/2021 4:00:00 AM,0.2153951 DEX129,OCO_1,30/09/2021 9:00:00 PM,i/10/2021 5:00:00 AM,0.1723991 DEX129,OCO_2,thirty/09/2021 x:00:00 PM,i/10/2021 half dozen:00:00 AM,0.1423465 DEX129,OCO_2,30/09/2021 11:00:00 PM,ane/x/2021 seven:00:00 AM,0.1424952 DEX129,OCO_2,1/10/2021 12:00:00 AM,ane/x/2021 8:00:00 AM,0.1455519 DEX129,OCO_2,i/ten/2021 1:00:00 AM,1/10/2021 ix:00:00 AM,0.1682339
Conceptually, this can be thought of as time-serial information of sensors metrics. Y'all may have noticed that the timestamp format is not in the "standard" format i.e. yyyy-mm-dd HH:MM:SS. And if you are loading data into data warehouses, they may take some constraints in parsing not-standard time formats. Redshift, for instance, can recognize many timestamp formats but not all. And the ane in the case above is a format that Redshift can't recognize when loading data to it (e.thousand. via Copy command from S3). So if the goal is to load this information into Redshift, this substantiates the need to do some bones transformation so that timestamp format in the source files is standardized and thus recognizable by Redshift during COPY operation.
Permit's look at an example lawmaking of how y'all can achieve this task optimally in Python:
So a lot is happening in the above snippet. Let'southward break information technology downwardly for better agreement:
- Line # 7: We create an S3 customer via boto3.client() method. It is suggested to use boto3.Session() so create boto3.client out of information technology (this article gives a good caption). For the sake of simplicity, I've but used boto3.client()
- Line # 9 : Nosotros create a binary stream using an in-memory byte buffer to store bytes. Retrieve of information technology as a way to write bytes to an in-memory file instead of an actual file on disk in simple terms. This volition be used to store data of compressed (gzipped) object which volition be uploaded to S3.
- Line # ten to 11: We open our source CSV file via python'due south open() function using "with" context manager based approach to brand sure that the file is properly closed at the end fifty-fifty in the case of exceptions. Then we apply csv.reader() and laissez passer information technology the opened file object. We also specify the delimiter of our CSV file.
- Line # 12: Equally our CSV file contains header so we use python's adjacent function to get the get-go item/line. (which is header in this instance and information technology will be of blazon listing).
- Line # xiii to fifteen: This is where we are doing the bodily processing i.e. parsing the timestamp format to a standard format. Nosotros iterate through the csv_rdr object which yields a row in each of its iteration. Nosotros use datetime.datetime.strptime(x[2], "%d/%m/%Y %I:%M:%S %p") to parse the timestamp values in the tertiary cavalcade (due to x[2]) as per the timestamp format (%d/%m/%Y %I:%K:%South %p), cast it as a string so store it. This gives us the timestamp values in the standard yyyy-mm-dd HH:MM:SS format.
- Line # 16: We append the transformed row to transformed_rows list. Just a annotation that this is non an optimized approach for big files w.r.t memory utilization. For efficient processing of big files, do consider using generators.
- Line # 18: We create a new listing transformed_rows_header which is a concatenation of the two lists i.eastward. header and transformed_rows. Thus, transformed_rows_header is a list of lists. The commencement element of this transformed_rows_header is header. Each subsequent chemical element is transformed rows with the parsed timestamp format. At this stage, nosotros accept the data transformed to our desired land just it is living within Python's object i.eastward. a list.
- Line # 19: Similar to how we open a file, we initialize a context director via gzip.GzipFile to specify that we want to write a gzip file. We specify mem_file BytesIO buffer that we initialized in line # 8 every bit our target i.e. where the output of gzip operation volition be written. We specify 'wb' as mode to specify that we want to write bytes. For compresslevel, we use 6 which is likewise default and gives a adept residuum of speed and compression ratio. Thus in a nut-shell, we set where nosotros desire to write the output of our gzip file (in this case, to in-memory buffer)
- Line # 20 to 22: Nosotros need to flush the transformed data (currently in transformed_rows_header list of lists) to a CSV format that tin then be compressed and uploaded as a file. For this, we initialize an in-retentiveness String Buffer named buff (like to in-retentivity Bytes IO Buffer) and use "csv" module to write the results from our list of lists i.e. transformed_rows_header to this in-retentivity file (buff) aka Cord Buffer. You lot can think of information technology writing CSV data out to a file but in this case, we are performing in-memory operation. Nosotros are not writing data out to deejay for the reasons discussed previously. Thus, at this phase, our "buff" now contains transformed data in the form of corking CSV format.
- Line # 23: We take the contents of in-memory transformed CSV file (buff) , compress it and write to the in-memory bytes buffer (mem_file). Besides, taking care of character encoding (i.e. UTF-8) as well. At the end of this, nosotros have transformed CSV information as bytes in gzip-compressed form in mem_file bytes buffer.
- Line # 24: Nosotros bring the stream position to the starting time of the in-memory buffer. Recall of it as a cursor which is now pointing to the start of the file. It is done so that when nosotros upload to S3, the whole file is read from the start.
- Line # 25: We use s3.put_object() method to upload data to the specified bucket and prefix. In this example, for Body parameter, we specify the mem_file (in-memory bytes buffer) which holds compressed and transformed CSV information
and viola! If you have taken care of AWS side of things due east.g. you have an account, a saucepan, an IAM user with the permissions to write to the bucket, an AWS CLI profile configured (or a role if you are in AWS already), then it should read the file, transform, shrink and upload it to the S3 bucket!
And so that's pretty much it. The commodity demonstrates a very simple scenario with basic transformation logic. You can use the same logic while using libraries like Pandas for slightly more advanced transformations every bit well. Happy Coding!
Source: https://levelup.gitconnected.com/efficiently-transforming-compressing-in-memory-and-ingesting-csv-files-to-aws-s3-using-python-da7bcec5f8f
0 Response to "How to Read a Csv File and Writing Into S3 Bucket Using Python"
Publicar un comentario