Why and How to Use random data generator…….

by: Alfred Williams, March 28, 2021, 408

409

VIEWS

Spread the love

Pandashas long been a favorite data science tool in Python programming language, for data wrangling/analysis.

Data is messy in the real world. Pandas is a game-changer when it comes cleaning, transforming and manipulating data. Pandas is a tool that helps to clear the clutter.

Table of Contents

My Story of NumPy and Pandas

When I started learning Python, NumPy random data generator (Numerical Python) was my first introduction. It is the core package for scientific computing with Python. It provides an wealth of useful features for operations using n-arrays or matrices in Python.

The library also vectorizes mathematical operations on the NumPy array types, which dramatically optimizes computation with high speed execution and increased performance.

NumPy looks cool.

There are still some needs for data analysis tools that can be used at a higher level. Pandas is my salvation.

The functionality of Pandas relies on NumPy, and both libraries are part of the SciPy stack. Pandas is heavily dependent on the NumPy array for its objects for manipulation, computation and other purposes. However, this stack can be used in a more convenient manner.

In practice, NumPy and Pandas can still be used interchangeably. My preference for Pandas is based on its high-level features and ease of use.

Pandas with large data — not BIG Data — are you a good choice?

There is a huge difference between big and large data. The hype surrounding big data makes it easy to just follow the flow and consider everything “big data”.

A well-known joke from Prof. Dan Ariely

Large and large are both relative terms. In my humble opinion, large data refers to data sets with less than 100GB.

Pandas works well with small amounts of data, usually from 100MB to 1GB. Performance is seldom an issue.

If you work in data science, or in big data, there is a chance that you will encounter a common problem when using Pandas. This is because Pandas has low performance and long runstime, which ultimately leads to insufficient memory usage when dealing with large data sets.

Pandas is not able to handle big data because of its algorithm and its local memory limitations. Big data is stored in computing clusters to increase scalability, fault tolerance, and reliability. It can also be accessed via big data ecosystem ( EC2, Hadoop, etc.). Spark and many more tools.

One way to make Pandas work with large data on local computers (with some memory limitations) is to decrease the memory consumption.

How to use Pandas when you have large data?

How can you reduce data memory usage using Pandas?

This explanation is based on my experience with a large anonymous data set (40-50GB). I had to reduce memory usage in order to fit into local memory to perform analysis (even before reading the dataset to a dataframe).

1. Read CSV file data in chunk size

Sincerely, I was confused when I encountered an error. I couldn’t access the CSV file data. I later realized that my local machine had too little memory to store the 16GB RAM data.

The good news is that Pandas.read_csvhas an option called chunksize.

This parameter simply indicates the maximum number of rows that can be read into a single dataframe in order to fit into local memory. The data is more than 70 million rows in size. I set the chunksize to 1,000,000 rows for each read, which broke down the large data set into smaller pieces.

Read CSV file data in chunksize

The above operation resulted is a TextFileReader object to be used for iteration. df_chunk does not represent a dataframe, but is an object that can be used for further operations in the next step.

Once the object was ready, I used the basic workflow to perform each operation and concatenate them to create a dataframe at the end. By iterating each chunk, I performed data filtering/preprocessing using a function — chunk_preprocessing before appending each chunk to a list. Finally, I combined the list into a final dataframe that fits into local memory.

Each chunk will be assigned a Workflow

2. To save memory, filter out irrelevant columns

It’s great. This stage was already complete.

To make data manipulation and computation faster, I also removed unimportant columns in order to free up memory.

Filter out unimportant columns

3. Column dtypes can be changed

To convert a column of pandas data to another type, the easiest way is to use.

I can tell you that Pandas allows you to change data types, which is very helpful if there are large amounts of data that need intensive analysis or computation. (For example, input data into your machine-learning model for training).

I decreased the number of bits needed to store data by up to half the memory.

It’s worth a shot. You’ll also find it useful, I think. Please let me know what you think. Let me know how it goes.

To save memory, you can change data types

Last Thoughts

It’s there. We are grateful for your time.

Sharing my experiences with Pandas and large data may help you to explore another feature of Pandas that can be used to manage large data. It reduces memory usage, which in turn improves computational efficiency.

Pandas typically has the majority of the features we need to data wrangling or analysis. They’ll be very useful to you next time, so I encourage you to look them up.

If you are serious about learning data analysis in Python, this book is for your — Python Data Analysis. The book provides detailed instructions on how to manipulate, clean, clean, and crunch data in Python using Pandas.

Spread the love

Alfred Williams

Alfred Williams, a distinguished business writer, navigates the corporate landscape with finesse. His articles offer invaluable insights into the dynamic world of business. Alfred's expertise shines, providing readers with a trustworthy guide through the complexities of modern commerce.