Commit Graph

90 Commits

Author SHA1 Message Date
Frej Drejhammar
f741bf39f2 bugfix: Avoid starting incremental conversions from scratch
Keys and values in the state cache are byte strings, therefore a
lookup of 'tip' will always fail. The failure makes the conversion
start over from the beginning, but as fast-export is deterministic the
results are the same, just very inefficient. The bug has existed since
the port to Python 3.

This patch switches the 'tip' lookup to use a byte string which should
make incremental conversions restart at the last converted commit. As
'x' == b'x' in Python 2, this should be a backwards compatible change.

Bug reported and fix suggested by Tomas Kolda.

Fixes #258.
2021-02-19 16:47:53 +01:00
Frej Drejhammar
7057ce2c2b Allow plugins to modify the committer
Plugins have since they were introduced been able to modify the author
of a commit, but not the committer. This patch adds the necessary
support for allowing them to also modify the committer.
2020-09-30 17:47:33 +02:00
Ondrej Stanek
9c6dea9fd4 Pass original hg commit hash to plugins 2020-07-31 10:50:51 +02:00
Ethan Furman
5c1cbf82b0 Add revision to commit_data for commit plugins
Co-Authored-By: ostan89@gmail.com
2020-07-31 10:48:33 +02:00
Ondrej Stanek
50631c4b34 Add option --ignore-unnamed-heads
This option allows the user to ignore only unnamed heads (compared to --force
which ignores all non-fatal issues). The intended use is for a future plugin
converting unnamed heads to named branches.
2020-07-31 10:30:53 +02:00
Ethan Furman
2a9dd53d14 Show all unnamed heads at once
Co-Authored-By: ostan89@gmail.com
2020-07-31 10:27:07 +02:00
chrisjbillington
d29d30363b Fix backward incompatible change for hg < 5.1
The port to Python 3 in b961f146 changed `repo.branchmap().iteritems()`
to use `.items()` instead. However, the object returned by mercurial
isn't a dictionary and its `.items()` method was only introduced (as an
alias for `iteritems`) in hg 5.1. `iteritems()` still exists, so let's
keep using it for now to retain compatibility with hg < 5.1.
2020-05-06 11:59:49 -04:00
Frej Drejhammar
f102d2a69f Merge branch 'PR/223'
Closes #223
2020-05-06 16:31:13 +02:00
Ondrej Stanek
cf0e5837b6 Allow converting a repository with git and hg subrepos
In the verification phase, fast-export falsely expects that both hg
and git subrepositories should have the appropriate line in the
subrepo-map file. The case is, that only hg subrepos need a line in
subrepo-map that references a converted subrepo, while git
subrepositories do not.
2020-05-06 16:30:05 +02:00
chrisjbillington
3b3f86b71e Allow utf8 in mappings
We were previously processing entries in mapping files (when
`--mappings-are-raw` is not given) with
`.decode('unicode_escape').encode('utf8')` to replace backslash escape
sequences in bytestrings with the utf-8 encoded characters they
represent. However, it turns out that `.decode
('unicode_escape')` assumes latin-1 encoding if it encounters non-ascii
bytes: https://bugs.python.org/issue21331. So this gave incorrect
results if non-ascii utf8 data was present in the mapping.

To fix this, we now add an extra layer of `.decode('utf8').encode
('unicode-escape')` in order to convert any non-ascii characters into
their backslash escape sequences. Then the subsequent
`.decode('unicode_escape')` only encounters ascii characters and gives
correct results.
2020-03-25 12:33:42 -04:00
chrisjbillington
6361b44c33 Fix bug in ignoring .git files/folders on Windows
Mercurial internally stores (most) filepaths using forward slashes, and
returns them as such from its Python API, even on Windows.

So the splitting up of filepaths with `os.path.sep` was incorrect,
resulting in `.git` files (those within a subdirectory, anyway)
not being ignored on Windows as intended. Splitting on `b'/'` regardless
of OS fixes this.
2020-03-08 19:40:50 +01:00
chrisjbillington
48508ee299 Fix failure to print error message in verify_heads
On Python 3, `b'%s' % None` fails with a TypeError. In verify_heads,
an error message prints the sha1 of a git commit, but that sha1
can be None.

This commit instead prints `b'<None>'` if sha1 is None.
2020-03-06 11:02:38 -05:00
Max Fuqua
750fe6d3e1 Resolve type error resulting from passing an int to b'%s' in python3 2020-02-29 14:55:15 -05:00
chrisjbillington
4071f720b0 Fix issue #203: Resolve stderr encoding issues
In Python 3, `sys.stderr.write()` requires unicode strings, and all
output on standard streams is UTF8 encoded. Therefore in the port to
Python 3, we `.decode()`d all strings that are used in `%` formatting of
strings to be printed to stderr.

However, in Python 2, `sys.stderr` accepts either bytestrings or unicode
strings, and:

- `%s` formatting of a bytestring with a unicode string, i.e  `"%s" %
  u"foo"` results in a unicode string.
- Writing a unicode string to stderr/stdout uses that stream's encoding
- When the output of the process is being piped somewhere other than a
  terminal (as it is when called with pipes and shell redirection from
  hg-fast-export.sh), that encoding is None, which implies ASCII.
- This raises UnicodeEncodeError if the unicode strings passed to
  `stderr.write()` have non-ascii characters.

We cannot fix this problem simply by encoding UTF8 again before writing
to stderr on Python 2. This is because the *decoding* of filenames with
the UTF8 codec may fail - filenames may not even be valid UTF8 desite
this being the declared filesystem encoding.

We could `fsdecode()` filenames on Python 3, which would use the
`surrogateescape` error handler, but stderr does not use this error
handler for output, meaning we would just have to encode again (with the
same error handler) anyway. And Python 2 lacks the `surrogateescape`
error handler in any case - we would need to reimplement it just to do a
round-trip decode and encode for no reason.

This commit leaves filenames and other repository data as bytestrings,
and simply writes them to `sys.stderr.buffer` on Python 3 or
`sys.stderr` on Python 2 as-is, after `%` formatting with bytestring
literals. This avoids encoding issues of filenames altogether.

Other writing to stderr that does not involve repository data has been
left with "native" strings, i.e.
`sys.stderr.write("a string literal %s" % a_command_line_arg)`. These
will still fail on Python 3 if the user passes a non-UTF filename as a
command line argument or similar. This is acceptable IMHO - although
`hg-fast-export` may encounter invalid UTF8 in mercurial repositories,
it is not too much to impose that the user name their branch mapping
files etc with valid UTF8!
2020-02-19 12:18:00 -05:00
chrisjbillington
b961f146df Support Python 3
Port hg-fast-import to Python 2/3 polyglot code.

Since mercurial accepts and returns bytestrings for all repository data,
the approach I've taken here is to use bytestrings throughout the
hg-fast-import code. All strings pertaining to repository data are
bytestrings. This means the code is using the same string datatype for
this data on Python 3 as it did (and still does) on Python 2.

Repository data coming from subprocess calls to git, or read from files,
is also left as the bytestrings either returned from
subprocess.check_output or as read from the file in 'rb' mode.

Regexes and string literals that are used with repository data have
all had a b'' prefix added.

When repository data is used in error/warning messages, it is decoded
with the UTF8 codec for printing.

With this patch, hg-fast-export.py writes binary output to
sys.stdout.buffer on Python 3 - on Python 2 this doesn't exist and it
still uses sys.stdout.

The only strings that are left as "native" strings and not coerced to
bytestrings are filepaths passed in on the command line, and dictionary
keys for internal data structures used by hg-fast-import.py, that do
not originate in repository data.

Mapping files are read in 'rb' mode, and thus bytestrings are read from
them. When an encoding is given, their contents are decoded with that
encoding, but then immediately encoded again with UTF8 and they are
returned as the resulting bytestrings

Other necessary changes were:

 - indexing byestrings with a single index returns an integer on Python.
   These indexing operations have been replaced with a one-element
   slice: x[0] -> x[0:1] or x[-1] -> [-1:] so at to return a bytestring.

 - raw_hash.encode('hex_codec') replaced with binascii.hexlify(raw_hash)

 - str(integer) -> b'%d' % integer

 - 'string_escape' codec replaced with 'unicode_escape' (which was
    backported to python 2.7). Strings decoded with this codec were then
    immediately re-encoded with UTF8.

 - Calls to map() intended to execute their contents immediately were
   unwrapped or converted to list comprehensions, since map() is an
   iterator and does not execute until iterated over.

hg-fast-export.sh has been modified to not require Python 2. Instead, if
PYTHON has not been defined, it checks python2, python, then python3,
and uses the first one that exists and can import the mercurial module.
2020-02-13 14:35:19 -05:00
Frej Drejhammar
595587b245 Merge branch 'PR/197'
Closes #197, #185, #196
2020-02-09 19:39:21 +01:00
Matthijs van der Burgh
0b6b83c3de Adapt to status becoming an object in Mercurial 5.3
Status has always been a tuple, but since 5.3, commit:
https://www.mercurial-scm.org/repo/hg/rev/c5548b0b6847, it is an object.
Therefore the __getitem__ of the tuple isn't available anymore.

This fix is compatible with mercurial>=4.6, as the old status tuple
still has the same properties.
2020-02-08 17:23:30 +01:00
chrisjbillington
8d135fe700 Ignore files and directories called .git
Git cannot track these files. Print a warning if encountering one.

Fixes #166
2020-02-07 17:52:57 -05:00
MokhamedDakhraui
9c9669d361 Check .hgsub and .hgsubstate files to detect subrepo changes 2020-01-26 00:36:34 +03:00
Dave Townsend
ab31fdcbaa Add support for git submodules
Mercurial supports not only submodules which are Mercurial
repositories, but also Git and Subversion repositories. This
patch adds support for submodules which are Git repositories to
hg-fast-export.

As submodules which are Git repositories won't need a mapping
file we trigger the submodule update only on the occurence of the
`.hgsubstate` file and push the check for a valid
`submodule_mappings` to `refresh_gitmodules(ctx)`
2019-12-07 10:22:23 -08:00
Dave Townsend
acf93a80a9 Only export submodules that exist in the submodule mapping. 2019-12-07 10:21:26 -08:00
Dave Townsend
0f49bfe0db Move hg sub-module updating to its own function, NFC
This refactoring is in preparation to supporting Mercurial
submodules which are git repositories.
2019-12-07 09:39:43 -08:00
Dave Townsend
ff1c885305 Ignore obsolete changesets in the source repository
Obsolete changesets are, for example, create by the Evolve
extension. This patch switches to an unfiltered repository (the
filtered one throws on an attempt to access obsolete revisions) and
then filters out the obsolete revisions when it comes across them.

Fixes #173
2019-10-20 19:45:42 +02:00
Frej Drejhammar
0096085b6f Tag maps should use the same syntax as branch and author maps
When version v171002 introduced a new mapping file format for branches
and authors, that change never made it to the remapping of tags
although the README documents it.

Fixes #172.
2019-10-12 21:09:14 +02:00
Frej Drejhammar
1181a0af47 Allow name sanitizer to be disabled with --no-auto-sanitize
Make it possible to completely disable the name sanitizer by the
--no-auto-sanitize flag. Previously the sanitizer was run on user
remapped names. As the sanitizer rewrites perfectly legal git
names (such as __.*) this is probably not what the user wants.

Closes #155.
2019-09-13 14:56:32 +02:00
MokhamedDakhraui
581b1b3d17 Remove git submodules if .hgsubstate file was removed or emptied 2019-08-18 05:46:46 +03:00
MokhamedDakhraui
7df01ac323 Refactor refresh_gitmodules()
Use the change context substate field instead of manually parsing the `.hgsubstate` file.
2019-08-16 02:42:03 +03:00
MokhamedDakhraui
914f5a0dbe Replaced several lambdas by one loop 2019-08-16 02:41:54 +03:00
MokhamedDakhraui
8779cb5e95 Extract operations with submodules to separated methods 2019-08-16 02:40:44 +03:00
Johannes Carlsson
47d330de83 Add support for mercurial subrepos
This adds a new command line option (--subrepo-map) that will
map mercurial subrepos to git submodules.

The --subrepo-map takes a mapping file as an argument that will
be used to map a subrepo folder to a git submodule.

For more information see the README-SUBMODULES.md.

This commit is inspired by the changes made by daolis in PR#38
that was never merged.

Closes: #51
Closes: #147
2019-01-07 18:41:19 +01:00
Johan Henkens
cadcfcbe90 Move filter_contents to plugin system 2018-12-05 13:25:48 -08:00
Johan Henkens
e895ce087f Add plugin system 2018-12-05 13:25:47 -08:00
Frej Drejhammar
ac60034ba3 Adhere to PEP 394
From PEP 394 [1]:

* python2 will refer to some version of Python 2.x.

* end users should be aware that python refers to python3 on at least
  Arch Linux (that change is what prompted the creation of this PEP),
  so python should be used in the shebang line only for scripts that
  are source compatible with both Python 2 and 3.

So to make sure that we run correctly on a system where python refers
to python3 and avoid problems like issue #11 we change the shebangs.

[1] https://www.python.org/dev/peps/pep-0394/
2018-08-11 15:07:19 +02:00
Anton Tykhyy
89db1d93cf Add --filter-contents 2018-06-17 21:09:59 +03:00
Frej Drejhammar
e200cec39f Adapt to changes in Mercurial 4.6
Starting with Mercurial 4.6 repo.lookup() no longer accepts raw hashes
for lookups.
2018-06-10 15:51:09 +02:00
Frej Drejhammar
50dc10770b Warn contributors from doing work that will no be merged
From time to time contributors spend time doing work that will not be
accepted as it duplicates functionality that is already provided with
the mapping files. Try to dissuade them from doing that by explaining
the reasons in the comment.
2018-02-01 07:03:03 +01:00
Frej Drejhammar
cc8fefe008 Change syntax of mapping files
This is done to allow escape sequences in the key and value strings.
2017-10-02 13:05:14 +02:00
Frej Drejhammar
e174c2a0b7 Refactor load_mapping() to move line parsing to inner function
This is done in preparation to allowing mappings to contain quoted
characters.
2017-09-29 18:50:41 +02:00
Frej Drejhammar
4bb50bb3fb Fix crash when a branch name starts with '/'
If a branch name starts with '/' it will be split into ['', ...] and
then mapped over with dot(), only dot() does not handle the empty
string. Teach dot() to handle the empty string.

This fixes the underlying problem in issue #91.
2017-05-14 14:32:59 +02:00
Frej Drejhammar
c614ae776b Fix "Branch ... modified outside hg-fast-export..." for sanitized branch names
The heads cache contains sanitized names, but we try to look up
unsanitized names, this is wrong. Switch to looking up the sanitized
name.
2016-04-15 15:46:47 +02:00
Frej Drejhammar
7224e420a7 Warn if one of the marks, mapping, or heads files are empty 2016-04-03 15:48:03 +02:00
Frej Drejhammar
b7cc6ab3bf verify_heads() needs to be aware of the branch renaming map
As all branches created on the git side are transformed by
sanitize_name(), this should be a safe backwards compatible change. If a
user is doing incremental imports and sanitize_name() now suddenly
modifies the branch name, verify_heads() would already have complained
on the first incremental run.

Thanks goes to Steve Tousignant<s.tousignant@gmail.com> for discovering
the problem.
2016-04-02 15:01:45 +02:00
Frej Drejhammar
6d8b4dbb11 Warn if opening a mapping file fails 2016-04-02 14:59:47 +02:00
Frej Drejhammar
832ee29bfa Refactor sanitize_name() to know about renaming map
Handle the lookup table for branch and tag renaming inside
sanitize_name().
2016-04-02 14:57:19 +02:00
Frej Drejhammar
46bf316a3c Explain why it is a bad idea to change sanitize_name()
This is a piece of code which frequently attracts pull requests which
are summarily rejected. As there is no "git blame" for rejected pull
requests, try to avoid misguided work by adding a comment at the
relevant place.
2016-04-02 12:28:03 +02:00
Frej Drejhammar
f75057e49a Make --hg-hash work in incremental mode
When an import is restarted the first new note commit must use
refs/notes/hg^0 as the parent. As refs/notes/hg is only updated at the
end of a session we cannot have it present in all note commits. Neither
can we generate new marks for note commits as that would require a new
mapping scheme from hg versions numbers to git marks. A new mapping
scheme would break existing incremental import setups.

We therefore restructure the code to do the notes at the end of an
import session, thus only requiring a refs/notes/hg^0 reference in the
first commit.
2016-01-10 14:00:02 +01:00
Han Sangjin
38e81367ec Add filename encoding option --fe
In some locales Mercurial uses different encodings for commit messages
and file names. The --fe option allows the filename encoding to be
overridden.
2015-11-13 11:39:47 +01:00
Frej Drejhammar
3c27c693e1 Allow branches and tags to be remapped
Branch and tag names can now be renamed using a mechanism similar to the
-A option for author names.

-B specifies a mapping file for branch names, and -T a mapping file for
tags.
2015-08-16 17:13:04 +02:00
Frej Drejhammar
a542b6aa97 refactor: Make author map loading more generic
This is the first step in adding mappings for branches and tags.
2015-08-16 13:09:51 +02:00
Frej Drejhammar
b9b6f2a57a Survive corrupt source repositories
Apparently a bug (http://bz.selenic.com/show_bug.cgi?id=3511) in
multiple released versions of Mercurial could produce commits where
files had absolute paths.

As a "healthy" repo should not contain any absolute paths, it should be
safe to always strip a leading '/' from the path and let the conversion
continue.
2015-08-15 20:26:02 +02:00