Static timetables, stop locations, and route shape information in General Transit Feed Specification (GTFS) format for all operators, including regional, trackwork and transport routes not available in realtime feeds.
I’ve downloaded this data and started working with the train data specifically. However, I’m having trouble with the trip data. Two issues that I’d appreciate some advice on are:
What does trip_id mean? This is a composite field which seems to have route_id embedded within it. Also, it does not conform with the documentation that is referenced on the download page.
Some sample trip id’s are 1273.TA.2-SCO-sj2-14.91.R, 1.TA.2-SCO-sj2-14.1.H. In each case, 2-SCO-sj2-14 is the associated route_id.
One document defined trip_id as …
The trip_id is the unique identifier for a particular trip. It is
composed of two fields. The first is the run number of the
trip. The second value is a unix timestamp which indicates
the planed start of the trip.
Another defines it as
The trip_id used to uniquely identify trips has a semantic content that could be used to
provide additional information about the timetabled train. The format is as follows:
<trip_name>.<timetable_id>.<timetable_version_id>.<dop_ref>.<set_type>.<number_of_car
s>.<trip_instance>
Neither of these seems to align with the trip_id values present in the data.
Duplicated trips. This may be a function of of my issues with (1) above?
By way of example (there are others), the trips 1005.TA.2-SCO-sj2-14.67.R, 1006.TA.2-SCO-sj2-14.67.R, 1008.TA.2-SCO-sj2-14.67.R all run from Port Kembla to Thirroul on Wednesday departing at 04:49:01 and arriving at 05:22:00. These trips each have a different service id linking through to the calendar data. That said, the stops and timings are the same. This implies that there are multiple trains running at the same time between the same stations.
The trip_id is defined as: <trip_name>.<timetable_id>.<timetable_version_id>.<dop_ref>.<set_type>.<number_of_car s>.<trip_instance>
I believe the example you have given is a bit more complex because those are two vehicles that join together to provide the service. You need to use block_id to combine them into the one trip. Hopefully one of our developers can explain that a bit better than I can.
Thanks for your note. I remain uncertain. I had read the document you refer to before my initial post.
Regarding my 2nd question above, the sample three trips I quoted in the initial post all have block id equal to blank so I cannot see how using block id will resolve my observation of apparent duplicates.
I have found other cases where block id does sensibly connect different shorter trips into a combined longer trip. That is not the particular problem that motivated my initial post.
As to my first question above, again it was in the document you referenced that I found one of the two different formats for trip id. I cannot see how the sample trip id’s I quoted in my initial post (and are extracted directly from the data) align with the format for trip id you posted (and is presented in the document you reference). The format you quote has 7 components separated by periods. The trip id’s I provided have only 5 components separated by periods.
The trip_id format varies depending on which GTFS bundle you’re using. In the Timetable Complete GTFS bundle, the trip_id is a concatenation of a route identifier and a bunch of internal identifiers that don’t mean much to consumers like us. You can check out the TransXChange release notes and get hints as to what the identifiers refer to, but in brief:
@alejandro.felman,
It appears that the Private Coach Services is neither comprehensive nor completed.
Many coach services are not included such as Firefly, Greyhouse, Australia Wide, etc
For agents that are included in the GTFS complete file, some operating routes are missing. An obvious example is the Sydney and Canberra route, which Murrays Coaches (agency_id = “B079”) is currently operating, as is Greyhouse (not included).
Any reason for these routes/agencies being excluded?
Chinh
Thanks for your note. I’ve also had a look at the releases notes you mention. I think I’ll just treat the trip id as a composite unique id for the moment.
As for duplicated trips, I’d already looked at the calendar and calendar_dates files before posting. No joy. Even with the extra information provided in those files, there are distinct trip_id’s which result in trains running on the same day, at the same time stopping at the same stations. Does not make sense.
Thanks @alejandro.felman,
I can see that Murrays has Canberra - Wollongong and Canberra - Narooma.
But the route Canberra - Sydney is missing. This is just one example of missing routes for an agency that is included in the complete GTFS.
Chinh
Hi @alejandro.felman and @david.phillips,
This is somewhat related to the Feedback requested for enhancements to Complete GTFS](Feedback requested for enhancements to Complete GTFS) but I am posting here since I have mainly played around with the complete GTFS posted in this page. I have tested with the sample of enhanced GTFS too and the problem described below still stands in that dataset.
I believe there is a problem with Sydney Trains Network dataset, which raises concerns around data quality. I try to describe as much as possible so that you can replicate the problem I found.
Merging the calendar file with the trips file and filter the shape_id (or route_id) to T4 line, I can see that the T4 line is served by 59 distinct service_id in this GTFS bundle.
If I filter these 59 distinct service_id to the regular Tuesday service (merge these service_id with the calendar file and keep only rows where tuesday = 1), I get 23 service_id as per below
I merge results of point 2. above with the calendar_dates file and filter to date = “2019-04-23” (tuesday) to see which service_id is actually scheduled to run on the Tuesday of 23 April 2019. Results below:
If we look at the last column (exception_type), they all 2, meaning that all of these 23 service_id that serve T4 will be removed, according to the GTFS bundle.
I then look at the canlendar_date on the same date to see if other service_id will be added to serve the T4. Result indicates that some service_id will be added on the 23 Apr 2019 but NONE of these added service_id actually serve the T4.
The GTFS data therefore tell me that no train will be scheduled to run on the T4 on the Tuesday of 23 Apr 2019. I therefore conclude that we have a serious issue with data quality.
I share the R code just in case you want to replicate this problem in R
NOTE: If I look at a different date, such as the 23 March 2019, at least I see some trips are scheduled to serve the T4 line. Also, if I look at the same date 23 Apr 2019 but a different network, such as Sydney Buses Network or TrainkLink, I still see trips running on these network. The problem described above happens to other dates (eg 25 Apr) and other train lines operated by Sydney Trains as well.
Hi @chinhho, if you have a look at just the Sydney Trains data you will see that it only has data up until the 18th of April. I know our documentation says that the data export is based on a 90 day period but that is not always the case:
5.2 Start & End Dates As per Data Scope validity period, the export is based on a 90 day period. Many Start & End dates will reflect this period by being valid for the entire period. However there will be calendars with shorter validity periods that start in the future or end earlier. In general these will relate to change in schedules (e.g. a timetable amendment).
I believe best practice is to download the bundle daily and then work with the data one or two weeks in advance, another developer might be able to confirm or let you know how they work with the data.
Hi, I’m noticing the same issue in the GTFS complete bundle as @k4werri in that I can see multiple trips that appear to be duplicates (i.e. trains running on the same day at the same time stopping at the same stations). Appears to only occur on Sydney Train data and Block ID does not seem to explain the issue.
Below screenshot of a recently downloaded bundle. Yellow is Saturday and orange are Sunday services. Second image shows that there are 6 services operating on two days at exactly the same time.
If I had to take a guess at the issue I’d say that the bundle is including trips that include a range of timetable verions as opposed to the latest timetable. Do you think that this would be the issue?
Is there any way to know what is the latest timetable id / timetable version id for a gtfs bundle?
We’re using the GTFS complete bundle and don’t have issues with duplicate trips you’re describing.
Having a quick look at the screenshot you’ve provided, it looks like you have a mix of old and new trips in your database. Are you clearing out old data before importing new data? For example, I wouldn’t expect to see both 125C.1317.129.128.A.8.58697160 and 125C.1197.103.128.A.8.57758629 in the same GTFS bundle because they represent the same trip but in different versions of the bundle published over time.
For the GTFS complete bundle, you also cannot assume IDs are consistent over time. For example, service ID TA+r1341+2 from a bundle published today could mean something completely different to TA+r1341+2 from a bundle published tomorrow. For example, the current bundle has that service ID as a Thursday service rather than a Saturday service as per above.
Hi @jxeeno thanks for jumping in. I agree it does look like a mix of old and new trips in the database. The screenshot of data is from a bundle downloaded a few days ago and imported into a blank sqlite database so if there were old trips they were in the bundle already. I’ll try again with a fresh bundle today but I suspect the same issue will crop up.
From your experience with the data, is there any way to identify old/new trips if they were in the same bundle?
Good to know that service IDs are recycled too; I’ll keep that one in mind.
Hmm, it shouldn’t be possible for the bundle to have both old and new trips. If you’re certain this is a fresh import into a blank sqlite database, maybe those are for separate services after all.
The next thing to look at would be how are you handling calendar_dates exceptions. Are you correctly determining which service IDs operate on which days based on entries in both calendar.txt and calendar_dates.txt?
Hi @jxeeno, I think its user error after all. Tried a new bundle and thought the issue disappeared. Then dug deeper into the calendar exclusions of the fresh bundle and old and I’ve realised that while there might be multiple seemingly overlapping services the way the calendar exclusion dates are recorded knock out the overlap (e.g. service A and B might be M-F with the same date range but the calendar dates might exclude service A half that time and service B the other half - rendering them effectively unique).
Thanks for your help!