India car database changelog

Changelog for https://www.teoalida.com/cardatabase/india/

In 2015 about 8% of website traffic was from India, every year dozens of Indians ask me if I can make an automobile database for their country. For long time I stayed away from doing this, for 3 reasons:
– Lack of reliable data source comparable with AutoKatalog magazine (Europe) and Edmunds.com website (America).
– The poverty of a third-world country (some customers offered me sums of money that makes me crying).
– The bad experience with some Indians, many people ask what they need but only a small fraction of them are actually willing to pay, difficult to estimate market size, easy to overestimate sale potential, also aggressive attitude and putting me to do the work for free and pay afterwards only “if they like the work”, sometimes asking me to do unwanted work out of my experience field, made me undecided if I should serve anyone in India.

Copy-pasting data from websites would have required dozens of hours, too much effort for the offers received, alternatively could pay a programmer to scrap data automatically from a website, but with the little sums of money offered by potential customers, had to do at least 10-20 sales to be able to cover the cost (freelance programmers charge few hundred dollars for scraping services).

The decision to build an Indian car database came in August 2015, after finishing a new phase in development of European databases that kept me busy between 25 July and 15 August, got some free time and I learned to scrap data from websites myself, this can reduce significantly the amount of time investment needed to create the India car database, reaching the line of profit. As coincidence, Carwale website was redesigned between 14 or 18 August 2015 according www.archive.org.

In just one week, been contacted by 3 new potential customers asking for a list of cars in India. 2 of them insisting to scrap data from a website and one was against automated scraping. If you’re against scraping please suggest an alternate way to get the data!

Note about updates inconsistency

Between 2015 and 2017 I ran scraper on make pages to get model URLs, then every model URL to get version URLs, then every version URL to get specifications, remove all data from previous update and put new data. All cars got updated (including prices) to current month.

In February 2017 Carwale website removed (hide) URLs leading to discontinued models. So my database contains valuable data that you cannot get yourself from Carwale anymore. I kept updating database by getting version URLs of new cars only, add URLs in existing data, compare the unique ID number from each URL, delete duplicates, then scrap all versions URL (new and discontinued) to get specifications including current price and last recorded price for discontinued cars.

In November 2017 Carwale removed unique ID from each URL, which was the ONLY way to distinguish multiple cars with exactly same name. All cars URL been changed and redirected to new URLs without ID number at end of URL, in 10 cases the old version URLs redirect to 404 Not Found, in 197 cases the old version URLs is redirecting to wrong car (multiple old URLs redirect to same new URL because of identical model name), making me impossible to re-scrap old cars for updates without risking loss of model versions.

The only way to update database is to run scraper on new cars only, add data into New & Old cars, use an Excel formula to identify duplicate URLs and delete them, remaining URLs I assume that they are cars launched last month and I add them at bottom of database. I add new cars each month, but cannot update older cars data anymore (example: price, which change often). This is not 100% reliable, if Carwale change/correct a model name it will reflect in different URL and I will add in database as new model, and if a model is discontinued and replaced by a new model with same name and URL, it will be not included.

In 2019 Carwale choose to concatenate multiple specifications into a single field (such as cubic centimetres, cylinders, valves and camshaft), causing inconsistencies in my database between old and new cars. Since old cars aren’t showing anymore on Carwale website to scrap data again for ALL cars as I did in 2015-2017, my database’s quality is at risk if Carwale continue to do changes on website (if you purchase “new cars only” database, don’t worry, it is consistent).

In April 2019 I made new scraper for Bikewale, adding individual versions in the Indian bikes database (in the previous editions, if a bike had multiple versions, database contained only base version).

Carwale re-added (some of) discontinued models in October 2020. I emailed update notifications to 30+ customers asking if I should continue adding new cars into existing database each month, bearing risk of inconsistencies, OR start a new database containing new and discontinued cars that are shown on Carwale (2000 cars less) with consistent data in each column and without duplicates. I emailed ~30 customers (subscribers) and only 2 replied choosing option 2. Another 2 new customers also choose option 2. So in December 2020 I made a new database format, solving problems of inconsistency.

Cars list of updates

~500 models, 2855 versions (998 in production), ~1000 KB – 25 August 2015 (25 columns). 3 eurocents/model = 85.65 euro.
Dimensions database: 3 eurocents/model / dimensions (6 columns).

~500 models, 2904 versions (partial update), 5393 KB – 28 October 2015 (176 columns).

528 models, 3121 versions (1044 in production), 5969 KB – 22 December 2015 (178 columns), 4 eurocents/model = 124.84 euro.
After 3rd sale I re-launch database in 4 formats: Make & Model (310 models), Dimensions (515 models), Basic Specs (2 eurocents/model), Full Specs & Features (4 eurocents/model).

549 models, 3214 versions, unreleased – 1 February 2016.

557 models, 3254 versions, unreleased – 1 March 2016.

547 models, 3290 versions (1078 in production), 6474 KB – 14 March 2016 (179 columns), 4 eurocents/model = 131.60 euro.

561 models, 3303 versions, unreleased – 1 April 2016.

564 models, 3354 versions, unreleased – 1 May 2016.

569 models, 3404 versions (1045 in production), 6556 KB – 1 June 2016.
Price capped at 120 euro, no further increases.

574 models, 3432 versions (1098 in production), 6877 KB – 1 July 2016.
Make & Model 319 models, Dimensions 472 models. Added ID column.

576 models, 3469 versions (1082 in production), 7312 KB – 1 August 2016 (183 columns).
Basic specs and No specs now include also prices. Status (production/discontinued) column removed because prices do this job.

579 models, 3509 versions (1101 in production), 7414 KB – 1 September 2016.

582 models, 3555 versions (1100 in production), 7125 KB (except colors) – 1 October 2016.

596 models, 3615 versions (1118 in production), 7634 KB – 1 November 2016 (186 columns).
Status column added back, prices removed from No specs. Image URL added. One customer told me to scrap an used cars website and by this way I found additional makes, discontinued, missed from my database: Mahindra-Renault (Logan and Sandero models that aren’t listed on either Mahindra or Renault), also Chrysler, Maini, Maybach, Willys, total 9 models, 50 versions.

605 models, 3661 versions (1115 in production), 7769 KB – 1 December 2016.

610 models, 3680 versions (1128 in production), 7824 KB – 1 January 2017 (188 columns).
Added car class and body style, two columns added in October as custom package for a specific customer, now they are offered to all customers.
No specs (5 columns) 505 KB, Basic specs (26 columns) 1357 KB.
Dimensions 535 models (10 columns) 169 KB, added car class and body style.
Make & Model 353 models (4 columns) 76 KB, added status and car class.

613 models, 3725 versions (1165 in production), 8002 KB – 1 February 2017.

619 models, 3795 versions (1147 in production), 8084 KB – 4 March 2017.

3843 versions (1155 in production), 8505 KB – 1 April 2017.

3945 versions (1122 in production), 9004 KB – 1 May 2017. URL column added.

3955 versions (1145 in production), 9315 KB – 2 June 2017. Some changes in the source website caused file size increase.

3980 versions (1140 in production), 9108 KB – 1 July 2017.

4016 versions (1144 in production), 9236 KB – 1 August 2017.

4062 versions (1168 in production), 9605 KB – 1 September 2017.

4097 versions (1179 in production), 9670 KB – 1 October 2017.

4144 versions (1192 in production), 10251 KB – 1 November 2017.

4181 versions (1161 in production), 10397 KB – 1 December 2017.

4218 versions (1180 in production), 11371 KB – 1 January 2018.

4246 versions (1134 in production), ???? KB – 1 February 2018.

4274 versions (1127 in production), 11544 KB – 1 March 2018.

4296 versions (1125 in production), 11596 KB – 1 April 2018.

4332 versions (1146 in production), 11702 KB – 1 May 2018.

4372 versions (1141 in production), 11832 KB – 1 June 2018.

4401 versions (1139 in production), 11884 KB – 1 July 2018.

4418 versions (1135 in production), 11939 KB – 1 August 2018.

4480 versions (1145 in production), ? KB – 1 September 2018.

4511 versions (1160 in production), ? KB – 1 October 2018.

4562 versions (1168 in production), ? KB – 1 November 2018.

4616 versions (1176 in production), ? KB – 1 December 2018.

4639 versions (1180 in production), 12663 KB – 1 January 2019.

4683 versions (1178 in production), ? KB – 1 February 2019.

4721 versions (1177 in production), ? KB – 1 March 2019.

4756 versions (1168 in production), 12982 KB – 1 April 2019.

4803 versions (1158 in production), 13137 KB – 1 May 2019.

4850 versions (1171 in production), 13271 KB – 1 June 2019.

4900 versions (1214 in production), 13431 KB – 1 July 2019.

4929 versions (1172 in production), ? KB – 1 August 2019.

4983 versions (1200 in production), ? KB – 1 September 2019.

5038 versions (1225 in production), 13782 KB – 12 October 2019.

5046 versions (1227 in production), 13782 KB – 1 November 2019.

5066 versions (1226 in production), 13808 KB – 3 December 2019.

1 Jan 2020: 5093 versions, new cars only: 288 models, 1247 versions, 14109 KB, added 6 more columns.

1 Feb 2020: 5155 versions, new cars only: 292 models, 1236 versions, 14259 KB.

7 Mar 2020: 5236 versions, new cars only 294 models, 1231 versions, ? KB.

1 Apr 2020: 5298 versions, new cars only: 289 models, 1136 versions, 14639 KB.

1 May 2020: 5364 versions, new cars only: 280 models, of which 208 models have 996 versions. The adoption of BS6 norms at 1 April ended production of numerous models, thus version count dropped.

1 June 2020: 5398 versions, new cars only: 277 models, of which 201 models have 940 versions.

1 July 2020: 5436 versions, new cars only: 276 models, of which 193 models have 933 versions.

1 August 2020: 5? versions, new cars only: 248 models, of which 171 models have 833 versions.

In August 2020 Carwale website was redesigned, I spend few hours editing scraper xPath codes. Carwale page source code no longer include Make and Model separated from Version, so the only place to get this info was in URL, that do not have correct capitalization.

On October 2020 Carwale re-added discontinued models, allowing me to re-scrap ALL cars and not just the ones currently in production, but it resulted 3317 model versions. I emailed update notifications to 30+ customers asking if I should continue adding new cars into existing database each month, bearing the risk of inconsistencies and duplicates described above + possible even more inconsistences in the future if Carwale redesign their website again, OR start a new database containing new and discontinued cars, ONLY those currently shown on Carwale (2000 cars less) with consistent data in each column and without duplicates?

of 30+ past customers emailed, only 2 replied choosing option 2. Another 2 new customers also choose option 2. So in December 2020 I redesigned database according customer preference and according current design of Carwale.

January 2021 had 1301 production cars, 2441 discontinued cars showing on Carwale, and I added 2570 cars from 2020 database that does not currently exist on Carwale.

March 2021 have 1074 production cars, 2512 discontinued cars, and I added 2806 cars from 2020 database that does not currently exist on Carwale (I copy-pasted all URLs from 2020 database into current database and deleted duplicates). The huge variations in number of cars indicate that Carwale is changing URLs over time, and because Carwale is no longer show an unique ID for each car since 2017, my job became a HELL in offering you a complete and duplicate-free database.

1061 versions in production, 3609 versions including discontinued, 6748 versions including deleted – 22 May 2021

1072 versions in production, 3497 versions including discontinued, 7797 versions including deleted – 1 August 2021

1101 versions in production, 3576 versions including discontinued, 7846 versions including deleted – 1 September 2021

Bikes list of updates

24 makes, 247 models – January 2016 (initial release).

25 makes, 271 models – October 2016.

Somewhere in 2017 I discovered that Bikewale have links to discontinued models, this increased model count to over 600. I did 2 more updates in 2017 but did not tracked them.

27 makes, 647 models, specs for 426 models – April 2018.

36 makes, 782 models, 1214 versions, specs available for 927 models – 4 April 2019.

41 makes, 845 models, 1219 versions – 13 October 2019.

41 makes, 857 models, 1271 versions – 29 January 2020.

41 makes, 866 models, 1301 versions – 1 May 2020.

41 makes, 937 models, 1386 versions – 2 August 2020.

43 makes, 947 models, 1491 versions – 20 March 2021.

49 makes, 978 models, 1569 versions – 26 May 2021 (temporary, not published).

49 makes, 979 models, 1570 versions – 4 June 2021.

50 makes, 1006 models, 1619 versions – 1 September 2021.

Leave a Reply

Your email address will not be published. Required fields are marked *