Engine Works

MeganBowers · ‎06-17-2024

You might need dummy data for various reasons—protecting data privacy, learning new software, creating portfolio projects, or asking for help on forums, to name a few.

There are ways you can mask sensitive data fields in Alteryx (like this), but sometimes you may need a whole new dataset.

Luckily, there are many (free) websites and repositories for data that you can use in your projects. Keep reading to learn more about creating and accessing the data you need.

Generating Dummy Data

If you need to create a dataset using certain parameters or need placeholder data to start a project, generating a new dataset (aka “dummy data”) may be the best way to proceed. Here are a few options:

Mockaroo: With the free plan, you can generate 1,000 rows of dummy data per download. One cool feature of Mockaroo is that you can generate fields using AI–describe your topic, and it will generate field names and types for you!

Social media data schema generated by AI

The output format options are extensive as well, including Excel files.

Dataconstruct: Generate up to 1000 rows of data by adding fields and selecting data types. There are many data types to choose from (e.g., Datetime, Countries, Currency codes, Street Addresses, etc.)

The output formats are more limited but great for development projects:

Generatedata: You can generate up to 20 preview rows of data for free by selecting the fields you want and the output format. The site has a nice user interface and lots of output format options.

generate data.png

Dataset Repositories

Sometimes, you want an existing dataset to plug and play with for your analysis. Much of the data on these sites below is real and can be helpful for portfolio projects, learning, and more.

Kaggle: You may have heard of Kaggle as it is commonly used in data education. Their datasets page is extensive; you can search by the analysis type you want to complete (e.g., classification, NLP, data visualization). Kaggle also gives each dataset a usability score and shows Python projects from other users who analyzed the dataset.

UC Irvine Machine Learning Repository: The place to go if you want to find images of 13,611 grains of 7 different registered dry beans.

In all seriousness, this site houses amazing data for machine learning projects. If you are upskilling in machine learning, you can find rich datasets with all kinds of feature types and subject areas.

Public Dataset Repository (GitHub): A testament to the power of open source, this massive list on GitHub contains links to public data in many industries. Instead of scouring the web for datasets, take a look at this list first! It is well maintained, with indicators for broken dataset links.

Google Dataset Search: Did you know that Google has a separate search engine for datasets? It pulls useful information into the search results so you can see a preview of the data, know when it was updated, and understand if it is openly accessible or not.

Dataset search.png

Conclusion

Hopefully, these resources will be useful for your next data project. Whether you generate data for a use case to build a prototype solution or need data to experiment with machine learning models, there is plenty out there for the taking. Or rather, the downloading.

If you get your dummy data somewhere else, let us know in the comments!

BS_THE_ANALYST · ‎06-17-2024

Thanks @MeganBowers there's some gems in here.

I didn't know about Google Dataset Search either.

Very cool stuff 😎.

CailinS · ‎06-18-2024

This is such a great compilation of resources and examples. Thank you @MeganBowers !

Engine Works

The Real Guide to Fake Data

Generating Dummy Data

Dataset Repositories

Conclusion

Interactive Core Cert Prep Guide

Le guide pratique pour importer plusieurs fichiers...

the real datetimenow ?

Boticário automatiza relatórios quase em tempo rea...

Re: Alteryx + PowerBI: L'intelligence de la data s...