How Do You Make a GeoIP Database?
Making a GeoIP Database
To make a GeoIP database, GeoIP database providers combine a variety of data sources with the goal of being able to provide a single answer for where each IP is located.
This guide is intended to help people understand how GeoIP data comes to be, not instruct you in the creation of your own.
Source: Reginal Internet Registry Database
When an organization like an ISP or hosting company wants an IP address, they ask the appropriate "Regional Internet Registry". In North America that's ARIN. ARIN will (if the request meets the registry's requirements) assign a block of IPs to them, and record them as the owners of those IPs. You can lookup the public information with the whois command:
whois 126.96.36.199 NetRange: 188.8.131.52 - 184.108.40.206 CIDR: 220.127.116.11/16 NetName: NETBLK-24-120-0-0 NetHandle: NET-24-120-0-0-1 Parent: NET24 (NET-24-0-0-0-0) NetType: Direct Allocation OriginAS: AS22773 Organization: Cox Communications Inc. (CXA) RegDate: 2001-02-21 Updated: 2014-12-09 Comment: For legal requests/assistance please use the following contact information: Comment: Comment: Cox Subpoena Phone: 404-269-0100 Comment: Comment: Cox Subpoena Info: http://www.cox.com/policy/leainformation/default.asp Ref: https://whois.arin.net/rest/net/NET-24-120-0-0-1 OrgName: Cox Communications Inc. OrgId: CXA Address: 1400 Lake Hearn Dr. City: Atlanta StateProv: GA PostalCode: 30319 Country: US RegDate: Updated: 2017-01-28 Comment: For legal requests/assistance please use the Comment: following contact information: Comment: Cox Subpoena Phone: 404-269-0100 Comment: Cox Subpoena Info: http://www.cox.com/policy/leainformation/default.asp Ref: https://whois.arin.net/rest/org/CXA
That IP address (
18.104.22.168) has been assigned to Cox Communications, and Cox
Communications has an address in Atlanta Georgia, USA. You can also see
that every IP address between
22.214.171.124 belongs to
Cox, so the database could start by assuming all those IPs are in
Extrapolating location from whois information works, with varying amount of detail for every IP out there, making it an easy place to start building a database from.There's two problems with this start:
- ARIN's terms of service specifically forbid using whois as part of a
You are specifically prohibited from using the Whois Service (i) as part of a commercial service or product, including the solicitation and servicing of your, or your employer's, customers, even if additional data not derived from the Whois Service is incorporated or (ii) for advertising, direct marketing, marketing research or similar purposes.
- That's my IP right now, and I'm not in Atlanta, Georgia: I'm in the Tropicana Hotel, Las Vegas, Nevada, almost 2,000 miles away.
Instead of looking at whois data, you could look at the hostname for an IP address to try to deduce its location:
nslookup 126.96.36.199 Non-authoritative answer: 188.8.131.52.in-addr.arpa name = wsip-24-120-53-94.lv.lv.cox.net.
Since we know the right answer, it's easy to spot "LV" and say, oh, it's in Las Vegas. Other answers are less revealing, for example where is this:
184.108.40.206.in-addr.arpa name = d226-114-3.home.cgocable.net.
One other network tool we can look at is traceroute:
traceroute 220.127.116.11 traceroute to 18.104.22.168 (22.214.171.124), 30 hops max, 60 byte packets 1 126.96.36.199 (188.8.131.52) 2.166 ms 2.104 ms 2.083 ms 2 10ge.tor-fr402-dis-1.peer1.net (184.108.40.206) 0.377 ms 0.379 ms 0.364 ms 3 10ge.xe-0-0-0.chi-eqx-dis-1.peer1.net (220.127.116.11) 38.202 ms 38.235 ms 10ge.xe-0-2-0.chi-eqx-dis-1.peer1.net (18.104.22.168) 38.446 ms 4 10ge-xe-0-1-1.dal-eqx-cor-1.peer1.net (22.214.171.124) 38.472 ms 38.462 ms 38.431 ms 5 10ge-xe-0-0-0.dal-eqx-cor-2.peer1.net (126.96.36.199) 38.056 ms 10ge.xe-0-3-0.dal-eqx-cor-2.peer1.net (188.8.131.52) 38.021 ms 10ge-xe-0-0-0.dal-eqx-cor-2.peer1.net (184.108.40.206) 38.247 ms 6 220.127.116.11 (18.104.22.168) 38.578 ms 38.509 ms 38.465 ms 7 nwstdsrj01-ae1.0.rd.lv.cox.net (22.214.171.124) 80.414 ms 80.032 ms sestdsrj01-ae1.0.rd.lv.cox.net (126.96.36.199) 67.647 ms 8 24-234-6-17.ptp.lvcm.net (188.8.131.52) 74.373 ms 184.108.40.206 (220.127.116.11) 80.662 ms 79.977 ms 9 wsip-24-120-53-94.lv.lv.cox.net (18.104.22.168) 81.316 ms 81.466 ms 81.305 ms
The traceroute tool yields yields that same hostname we saw before on the last line, but if that last line wasn't helpful previous hops may have been. The second or third last hop will often tell you a bunch about the provider to that IP.
Ping is a simple tool that tells you how long it takes to get a message to a remote server and get a response. Ping data is helpful when the time is very short, but less useful when it takes longer. The reasonsing here is: if ping takes very little time, the server being pinged must be close. If ping takes a while the server could be further OR it could have been routed poorly. In one example we've seen packets travelling between our Baltimore & New York datacenters first went through Paris France, then London UK, then over to New York.
To work well, you need to really know where a server is, or ideally several servers. I know exactly where our webserver is (I took the physical server to the building and put it on the rack), and I trust that our providers give us an accurate location (though some lie, we find alternate providers).
According to some GeoIP providers, the IP
is in Kuwait, but if I ping it from our Frankfurt location I see:
wproxy@frankfurt *** 15:06:22 *** ~ > ping 22.214.171.124 PING 126.96.36.199 (188.8.131.52) 56(84) bytes of data. 64 bytes from 184.108.40.206: icmp_seq=1 ttl=60 time=1.54 ms 64 bytes from 220.127.116.11: icmp_seq=2 ttl=60 time=1.08 ms 64 bytes from 18.104.22.168: icmp_seq=3 ttl=60 time=0.822 ms 64 bytes from 22.214.171.124: icmp_seq=4 ttl=60 time=0.934 ms
Light travels about ~120.9 miles/millisecond through fibre, it's about 2,500 miles from Frankfurt to Kuwait, so that's not physically possible.
Ping is a great way to prove some of your datapoints wrong.
Putting it all together
Building a database will involve combining all the data sources to form a single authortative answer for each IP. A good start would be:
- License data from the five regional internet registry groups: AFRINIC, ARIN, APNIC, LACNIC, and RIPE NCC to start with some data for all IPs
- Put real effort into a good UI that would allow someone to manually look at all the data we've generated for an IP, and make a determination on its location
- Start running some hostname checks on each block of IPs, and generate some rules to turn those into approximate locations
- Combine hostname data with traceroutes, it's common to use local airports codes in hostnames for big routers, which can help speed processing.
- Use servers with known locations to spot-test some purportedly nearby systems with ping