Tell me WHICH SPECS are you looking for, and I will tell you if is possible! Excel sample available at request.
My personal idea is 14 technical specs + year, make, model, trim, body… but I need to do a survey to ensure that is enough for american tastes, before wasting time developing a database that nobody wants it (July 2013). Here is the SAMPLE made in July 2013, and never published on website (except being shown in blurred screenshot) inviting people to contact me.
I have given the Excel file ONLY to <10 people who have contacted me, and we discussed about it, thus helping me to make correct decisions.
FINAL sample file published in February 2014, full database available for sale in May 2014.
Original database made in 2014 that I was not able to update
All European databases plus American Year-Make-Model (without specifications), are compiled and updated manually by myself (1-person work, since 2003), data taken from books and typed in Excel, etc.
But this American Year-Make-Model-Trim-Specs is a 2-person work, for the first time I “cheated” and crawled a website = scraped data automatically with a program.
More exactly, I found that one of my customers was also a programmer, we agreed to exchange a free database (refunded his purchase) for helping me making a crawler for free, and crawl a website few times per year so I can provide regular updates. The job turned HARDER than he expected, first and second attempts it crawled only first trim of each model, the programmer had to rescript it 3 times to find every trim, 20 hours of work in C# in weekends, finished in April and gave me raw CSV data. Then, myself as a car expert, I spent just few hours enhancing data visually in Excel, deleted duplicates (cars accidentally crawled twice), fixed few fatal errors, filled some of the missing data, and launched database for sale in May 2014.
Final data sheet had 41425 rows, 28 columns.
Note: at 2015 recheck I found that each duplicate indicated one or more missing vehicles, about 1000 vehicles were missing, so even at 3rd attempt the crawler wasn’t finding EVERY vehicle.
After doing few sales over next months, I asked him to crawl again so I can update the database, but shit happened: the source website changed its architecture so the crawler was no longer able to find any car. It need to be rescripted, meantime the programmer started his own company and no longer had time to re-re-rescript my crawler every time I want an update.
What I can do now? Hire a dedicated programming company? I tried, but they are interested in MONEY, not in making a perfect job, charge me $1000+ (this is more than what I earned from sales of this particular database in whole 2014), full payment in advance, I have no guarantee that they do a good job from first attempt and probably will charge me additional money to re-re-rescript crawler (as happened with my programmer).
Instead of looking for a programming company and risk my money, I am looking to make a new deal of mutual benefit, with a customer with programmer skills, personally interested in car data who can script a crawler and modify it until it runs perfectly, without additional charges.
You code a crawler, give me in-page raw data or just list of URL, for FREE and I will give an Excel database back to you in about 1 week for FREE, enhanced, duplicates removed, errors corrected (with my car experience), filled up car dimensions, etc. All these at the cost of my time, FREE of charge for you!
Some people told me from start that crawling that website is too complex for their skills. From January to July 2015, 3 potential customers agreed to make the mutual deal, I gave them details what to crawl, two stopped replying, 3rd one which we agreed to PAY him some money beside giving free database, after 2 weeks he gave up due to complexity. FAIL!
Remaking the database in November 2015… and offer regular updates
After failed to find a good programmer to help me scraping a website in exchange of a FREE database, in August 2015 I found www.import.io allowing to scrap data myself. Extracting data from a specific list of pages only, import.io cannot crawl all pages of a website automatically. But this makes me able to update database myself, without being dependent by third-party programmers, in semi-manual, semi-automatic way.
Spent 50 hours of manual work: visit 5000 pages and copy-paste their URL (2015-2016 models as well as old models having duplicate cars that equal a missing cars), then let import.io running in background for 40 hours (in batches) and automatically extract data from each URL, then do manual improvements in Excel.
Was busy with The BIG Car Database for Europe until 17 October, so the actual update for America stated in 19 October and finished on 10 November (with a break 20-28 October).
46813 mode trims – November 2015.
Offered in 4 versions, trims and full specs, whole database and new cars only (unlike 2014 database in a single price option), the dimensions database is yet to be filtered and launched.
47860 model trims – May 2016.
Sales were slow at start, but since March they were rising so I did first update in May.
48864 model trims – September 2016 (except Ford F-Series and RAM)
I decided to do a new update after 4 months, and when almost finished adding URL of 1000+ additional models, import.io suspended my account and limited new sign ups to scraping 500 pages per month, making me unable to add data for the newly added cars. My programmer partner from Pakistan started developing own universal scraping software similar with import.io, which took over 1 month of coding in Visual Basic until it was running properly.
40978 model trims – November 2016 (added F-Series and RAM)
I released the September update, doing scraping with my partner software.
50468 model trims – 22 March 2017
After doing main yearly update for Car Models & Engines Database for Europe in february, I started doing an update for American database, between 10 and 22 March. I am sorry that took so much to release a new update.
50772 model trims – 26 May 2017
Partial update just for new cars, I made more scraping scripts that indicate me where 2018 models are launched, instead of checking all 2017 models from A to Z to see if 2018 model was launched to add it, I checked only models that were actually launched. Time required to add URL of new models decreased from 30 to 5 hours, allowing me to update more often!
1093 models, 8184 model years, 51716 model trims – 22 August 2017
Full update done between 13-14 August and scraping 15-20 August.
1098 models, 8293 model years, 52607 model trims – 14 November 2017
Full update done between 6-7 November and scraping 8-12 November.
1116 models, 8386 model years, 53367 model trims – 18 March 2018
Full update done between 10-11 November and scraping 12-17 November.
Adding all missing cars and guarantee completion in March 2018
My database still relied on the scraping done by Singapore programmer in 2014, which due to very complex model hierarchy of Edmunds website, he missed many styles and captured twice some styles. At November 2015 update I checked the duplicate cars which equal missed cars and added them, I never guaranteed that I included every car available on Edmunds, I was aware that there could be additional missing cars with no duplicate car that could indicated a missed cars.
New cars added since 2016 also had some missing styles, because I was looking for new model years launched and added the styles offered at initial launch, at future updates I wasn’t always checking models already launched to see if they added later additional styles.
To guarantee completion I would had to manually open each model, each year, etc and check the URLs. I started this job manually in December 2015 but took more than 10 hours just to do from Acura to Buick so I had to stop.
After I mastered my scraping skills and after Pakistan programmer made an universal scraping software for me in 2016, I was able to do check for missing cars easier. But I didn’t hurry.
After 18 March 2018 update with new cars, I started the check for missing models by tagging 1 trims/style of each model year and running them in scraping software to extract IDs of all trims/styles, compare list of styles from scraper with list of trims/styles in my database, quickly spotting where something is missing to add them. This job took about 50 hours during 7 days, I added 623 missing trims/styles from 1990-2018 plus 40 styles of 2019 models launched in March 2018. By this way I also put the trims in same order as they were displayed on Edmunds (with few exceptions) and added nice borders 1-pixel thick between years, 2 pixels between models, 3 pixels between makes.
I had a terrible LUCK to do this job between 20 and 26 March 2018 because on 30 March 2018 Edmunds removed the old individual pages for each trim/style, and due to the bug in new layout I wouldn’t have been able to do this job.
54030 model trims – 1 April 2018 (5 2019 models).
A major change in April 2018, features columns removed
In late 2016 or early 2017 Edmunds website introduced a new layout to display specs for up to 3 trims at once that could not be scraped with my programmer’s universal scraping software. The old pages of each individual trim/style remained on Edmunds and I continued to use them to scrap data from, until they were removed on 30 March.
Now, the only solution to scrap data is from JSON file of style ID (example JSON), paid $200 to Pakistan programmer to make a JSON scraper that automatically create columns for every label found. If the old HTML pages of each style contained 20 category labels listing features in text form (thus my database had 20 columns of features), JSON file contains beside specifications, a large amount of features with true/false values, but the feature labels are not consistent between cars so scraping all cars generated a database with few thousands columns. Had to delete all features and stick on the 53 columns of specifications.
JSON file contain trim and description differently than what was previously in my database. JSON file does not contain year, make, model, requiring me to add this info manually. JSON file contain car price, allowing me to include price in my database for first time (requested by many customers). However JSON file do not contain image URL so I am sorry for all customers who bought this database for car photos.
The new layout of Edmunds contains several bugs, one of them is when you go to a car having multiple body styles, example https://www.edmunds.com/ford/f-150/2018/ it shows only trims of Regular Cab, when you select from SuperCab or SuperCrew then click Features, it switch back to Regular cab. On the old layout pages there was a drop-down box showing trims of all body styles. My programmer partner confirmed that is NOT POSSIBLE to get vehicle IDs of other body styles than 1st body style.
I emailed Edmunds to report this issue and the issue was fixed next month. Surprisingly the old individual pages for each trim/style appeared back on Edmunds, allowing me to return to previous database format that include photos and features, or maybe a combination between two formats.
1004 models, 8242 models years, 54486 model trims – 22 June 2018 (2019: 64 models years, 516 model trims).
1007 models, 8354 models years, 55337 model trims – 20 September 2018 (2019: 164 models years, 1343 model trims).
Features columns added back
Beside bringing back old pages, Edmunds also made an anti-scraping feature that blocks the universal scraper made by my programmer partner. In October 2018 he made another scraper just for Edmunds which allowed me to scrap old HTML pages again, adding back the 20 columns of features, image URL, old style Trim and Description, etc, expanding database from 62 to 86 columns.
Do note that this is NOT future-proof, Edmunds can remove old HTML pages again sooner or later. Also the scraping of HTML pages is much slower than scraping JSON, so to serve customers needing these 24 extra columns I need to spend 7 days each update instead of 2 if I were to update only the original 62 columns.
1046 models, 8411 models years, 55762 model trims – 10 November 2018 (2019: 215 models years, 1707 model trims).
1057 models, 8505 models years, 56394 model trims – 5 February 2019.
I paid my programmer partner to make a 3rd scraper, for https://www.edmunds.com/car-maintenance/guide-page.html that allow me getting car IDs faster. This reduce time required each update and allow me updating more often. Also instead of quarterly updates that add new models with their trims, and 1 yearly update that check all old models for possible missing trims launched later than initial model launch, now each quarterly update will add possible missing trims in old models.
1064 models, 8580 models years, 57169 model trims – 11 May 2019.
? models, ? model years, 57708 model trims – 6 August 2019. I started this update in July when I got 57703 model trims, but personal problems kept me away from business. I ran scraper on 6 august and after adding 539 rows on 7-8 august, making 57708 model trims, just for curiousity I tried to run scraper again on 8 August, finding another 200 model trims, after spending 3 hours to add them I ran scraper for 3rd time and found ~10 more. Seems that my update overlapped with a major update in Edmunds, I decided to NOT release update for my customers and wait ~7 days then run scraper again on 15 August, I found another ~100 cars.
1073 models, 8697 model years, 58044 model trims – 15 August 2019.
1038 models, 8740 model years, 58480 model trims – 22 September 2019.