Hello,
I've been trying to figure this out for a little while and have hit a wall. I've got two separate datasets that I have filtered down that is still rather large. 1 being 576k records and the other being just over 4 million records. I'm trying to figure out the best way to "search" the 4 million records if a column named phone_number contains any of the phone_number records listed in the 576k. I was trying to do a join on the column phone_number but based on the size of these datasets was taking a very long time, and I'm not even positive that would be the best way to do it. Have you done something similar and had success? If so, how did you do it? One of those things that I've been looking at for a while and just need a second set of eyes/ ideas.
Solved! Go to Solution.
Have you tried these things?
1)Keep only the phone column
2)Clean the phones, keeping just numbers on them
3)Filter out null and empty phones
4)remove duplicates
If not, this probably will help, you probably have a lot of null/empty/duplicated data on your dataset
Another thing that you could do, you could filter out phones that have less than N characters (you wont have a valid phone with less than 6 characters for example, right?)
Thank you Felipe! I added in those RegEx expressions as well as the Filters and it is doing what I would like now. Also making it much, much faster.