I had to manipulate a large amount of text data at work, and came across two neat tricks I’d like to remember.
If I have a Product A that’s always bundled with Product B, but I also sell Product B as a separate item, how many users do I have that have only purchased Product B? It’s not as easy as querying all Product B purchasers, because I’ll pick up purchasers of Product A as well.
So, I made a list of all Product A purchasers, and a list of all Product B purchasers, and made a final list of Product B purchasers that didn’t show up on the Product A purchasers list.
#AccountsWithProductA is a list of all purchasers of Product A #AccountsWithProductB is a list of all purchasers of Product B from sets import Set SetOfProductAUsers = set(AccountsWithProductA) SetOfProductBUsers = set(AccountsWithProductB) SetOfProductBUsers -= SetOfProductAUsers
After those operations, SetOfProductBUsers only contains exclusive Product B users. It’s a handy manipulation that’s tough to do with lists alone.
Another problem I faced is that the initial parse of the data to extract accounts would fill the list with duplicate accounts since many accounts would purchase the products again from time to time. I did some Googling to track down a way to prune duplicates in a list and found this handy StackOverflow post.
I took the naive approach at both actions (removing duplicates and pruning members of a list that existed in another list) and didn’t have much luck. What I was doing worked with my test data, but when I passed the gigantic real data at it I ran out of memory. There’s a lot about Python internals I don’t understand!