The dataset is published on the Kaggle website. Plain Text1
Stemmed description tokens and application data have been collected for 293392 applications (most popular). There are no application names in the dataset; unique IDs identifies them. Before tokenization, most of the descriptions were translated into English.
The dataset consists of four files:
First of all, let’s see how the data is distributed across operating systems.
Android apps dominate in data. Most likely, this is because more applications for Android are being created. Considering that the dataset contains only the most popular applications, it is interesting to see how the release date is distributed.
Most of the applications are updated regularly since the last update date is not far in the past.
Basic data was collected over a short period of time in January.
It is interesting to see how the genres are distributed. Taking into account the os imbalance, I will normalize the data for the histogram.
We can see that genres do not completely overlap. This is especially noticeable in games. Is there something we can do about it? The most obvious thing is to reduce the number of genres for Android and bring them to the same form as for iOS. But I suppose that this is not the best option, since there will be a loss of information. Let’s try to solve the inverse task. To do this, I need to build a model that can predict genres for iOS applications.
I created some additional features using the description length and the number of tokens.
Department of Information Technologies: https://www.ibu.edu.ba/department-of-information-technologies/